diff mbox series

[v2,1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd

Message ID 20181009222042.9781-1-joel@joelfernandes.org (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd | expand

Commit Message

Joel Fernandes Oct. 9, 2018, 10:20 p.m. UTC
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then drop its protection for "future" writes
while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer. See CursorWindow in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

int main() {
    int ret, fd;
    void *addr, *addr2, *addr3, *addr1;
    ret = memfd_create_region("test_region", REGION_SIZE);
    printf("ret=%d\n", ret);
    fd = ret;

    // Create map
    addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED)
	    printf("map 0 failed\n");
    else
	    printf("map 0 passed\n");

    if ((ret = write(fd, "test", 4)) != 4)
	    printf("write failed even though no fs-write seal "
		   "(ret=%d errno =%d)\n", ret, errno);
    else
	    printf("write passed\n");

    addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr1 == MAP_FAILED)
	    perror("map 1 prot-write failed even though no seal\n");
    else
	    printf("map 1 prot-write passed as expected\n");

    ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE);
    if (ret == -1)
	    printf("fcntl failed, errno: %d\n", errno);
    else
	    printf("fs-write seal now active\n");

    if ((ret = write(fd, "test", 4)) != 4)
	    printf("write failed as expected due to fs-write seal\n");
    else
	    printf("write passed (unexpected)\n");

    addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr2 == MAP_FAILED)
	    perror("map 2 prot-write failed as expected due to seal\n");
    else
	    printf("map 2 passed\n");

    addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
    if (addr3 == MAP_FAILED)
	    perror("map 3 failed\n");
    else
	    printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
fs-write seal now active
write failed as expected due to fs-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Note: This seal will also prevent growing and shrinking of the memfd.
This is not something we do in Android so it does not affect us, however
I have mentioned this behavior of the seal in the manpage.

Cc: jreck@google.com
Cc: john.stultz@linaro.org
Cc: tkjos@google.com
Cc: gregkh@linuxfoundation.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.

 include/uapi/linux/fcntl.h | 1 +
 mm/memfd.c                 | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

John Stultz Oct. 16, 2018, 9:57 p.m. UTC | #1
On Tue, Oct 9, 2018 at 3:20 PM, Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
> Android uses ashmem for sharing memory regions. We are looking forward
> to migrating all usecases of ashmem to memfd so that we can possibly
> remove the ashmem driver in the future from staging while also
> benefiting from using memfd and contributing to it. Note staging drivers
> are also not ABI and generally can be removed at anytime.
>
> One of the main usecases Android has is the ability to create a region
> and mmap it as writeable, then drop its protection for "future" writes
> while keeping the existing already mmap'ed writeable-region active.
> This allows us to implement a usecase where receivers of the shared
> memory buffer can get a read-only view, while the sender continues to
> write to the buffer. See CursorWindow in Android for more details:
> https://developer.android.com/reference/android/database/CursorWindow
>
> This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
> prevents any future mmap and write syscalls from succeeding while
> keeping the existing mmap active. The following program shows the seal
> working in action:
>
> int main() {
>     int ret, fd;
>     void *addr, *addr2, *addr3, *addr1;
>     ret = memfd_create_region("test_region", REGION_SIZE);
>     printf("ret=%d\n", ret);
>     fd = ret;
>
>     // Create map
>     addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>     if (addr == MAP_FAILED)
>             printf("map 0 failed\n");
>     else
>             printf("map 0 passed\n");
>
>     if ((ret = write(fd, "test", 4)) != 4)
>             printf("write failed even though no fs-write seal "
>                    "(ret=%d errno =%d)\n", ret, errno);
>     else
>             printf("write passed\n");
>
>     addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>     if (addr1 == MAP_FAILED)
>             perror("map 1 prot-write failed even though no seal\n");
>     else
>             printf("map 1 prot-write passed as expected\n");
>
>     ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE);
>     if (ret == -1)
>             printf("fcntl failed, errno: %d\n", errno);
>     else
>             printf("fs-write seal now active\n");
>
>     if ((ret = write(fd, "test", 4)) != 4)
>             printf("write failed as expected due to fs-write seal\n");
>     else
>             printf("write passed (unexpected)\n");
>
>     addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
>     if (addr2 == MAP_FAILED)
>             perror("map 2 prot-write failed as expected due to seal\n");
>     else
>             printf("map 2 passed\n");
>
>     addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
>     if (addr3 == MAP_FAILED)
>             perror("map 3 failed\n");
>     else
>             printf("map 3 prot-read passed as expected\n");
> }
>
> The output of running this program is as follows:
> ret=3
> map 0 passed
> write passed
> map 1 prot-write passed as expected
> fs-write seal now active
> write failed as expected due to fs-write seal
> map 2 prot-write failed as expected due to seal
> : Permission denied
> map 3 prot-read passed as expected
>
> Note: This seal will also prevent growing and shrinking of the memfd.
> This is not something we do in Android so it does not affect us, however
> I have mentioned this behavior of the seal in the manpage.
>
> Cc: jreck@google.com
> Cc: john.stultz@linaro.org
> Cc: tkjos@google.com
> Cc: gregkh@linuxfoundation.org
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Reviewed-by: John Stultz <john.stultz@linaro.org>

thanks
-john
Christoph Hellwig Oct. 17, 2018, 9:51 a.m. UTC | #2
On Tue, Oct 09, 2018 at 03:20:41PM -0700, Joel Fernandes (Google) wrote:
> One of the main usecases Android has is the ability to create a region
> and mmap it as writeable, then drop its protection for "future" writes
> while keeping the existing already mmap'ed writeable-region active.

s/drop/add/ ?

Otherwise this doesn't make much sense to me.

> This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
> prevents any future mmap and write syscalls from succeeding while
> keeping the existing mmap active. The following program shows the seal
> working in action:

Where does the FS come from?  I'd rather expect this to be implemented
as a 'force' style flag that applies the seal even if the otherwise
required precondition is not met.

> Note: This seal will also prevent growing and shrinking of the memfd.
> This is not something we do in Android so it does not affect us, however
> I have mentioned this behavior of the seal in the manpage.

This seems odd, as that is otherwise split into the F_SEAL_SHRINK /
F_SEAL_GROW flags.

>  static int memfd_add_seals(struct file *file, unsigned int seals)
>  {
> @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
>  		}
>  	}
>  
> +	if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
> +		file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> +

This seems to lack any synchronization for f_mode.
Joel Fernandes Oct. 17, 2018, 10:39 a.m. UTC | #3
On Wed, Oct 17, 2018 at 02:51:55AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 09, 2018 at 03:20:41PM -0700, Joel Fernandes (Google) wrote:
> > One of the main usecases Android has is the ability to create a region
> > and mmap it as writeable, then drop its protection for "future" writes
> > while keeping the existing already mmap'ed writeable-region active.
> 
> s/drop/add/ ?
> 
> Otherwise this doesn't make much sense to me.

Sure, you are right that "add" is more appropriate. I'll change it to that.

> > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
> > prevents any future mmap and write syscalls from succeeding while
> > keeping the existing mmap active. The following program shows the seal
> > working in action:
> 
> Where does the FS come from?  I'd rather expect this to be implemented
> as a 'force' style flag that applies the seal even if the otherwise
> required precondition is not met.

The "FS" was meant to convey that the seal is preventing writes at the VFS
layer itself, for example vfs_write checks FMODE_WRITE and does not proceed,
it instead returns an error if the flag is not set. I could not find a better
name for it, I could call it F_SEAL_VFS_WRITE if you prefer?

> > Note: This seal will also prevent growing and shrinking of the memfd.
> > This is not something we do in Android so it does not affect us, however
> > I have mentioned this behavior of the seal in the manpage.
> 
> This seems odd, as that is otherwise split into the F_SEAL_SHRINK /
> F_SEAL_GROW flags.

I could make it such that this seal would not be allowed unless F_SEAL_SHRINK
and F_SEAL_GROW are either previously set, or they are passed along with this
seal. Would that make more sense to you?

> >  static int memfd_add_seals(struct file *file, unsigned int seals)
> >  {
> > @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
> >  		}
> >  	}
> >  
> > +	if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
> > +		file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > +
> 
> This seems to lack any synchronization for f_mode.

The f_mode is set when the struct file is first created and then memfd sets
additional flags in memfd_create. Then later we are changing it here at the
time of setting the seal. I donot see any possiblity of a race since it is
impossible to set the seal before memfd_create returns. Could you provide
more details about what kind of synchronization is needed and what is the
race condition scenario you were thinking off?

thanks for the review,

 - Joel
Christoph Hellwig Oct. 17, 2018, 12:08 p.m. UTC | #4
On Wed, Oct 17, 2018 at 03:39:58AM -0700, Joel Fernandes wrote:
> > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > > To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
> > > prevents any future mmap and write syscalls from succeeding while
> > > keeping the existing mmap active. The following program shows the seal
> > > working in action:
> > 
> > Where does the FS come from?  I'd rather expect this to be implemented
> > as a 'force' style flag that applies the seal even if the otherwise
> > required precondition is not met.
> 
> The "FS" was meant to convey that the seal is preventing writes at the VFS
> layer itself, for example vfs_write checks FMODE_WRITE and does not proceed,
> it instead returns an error if the flag is not set. I could not find a better
> name for it, I could call it F_SEAL_VFS_WRITE if you prefer?

I don't think there is anything VFS or FS about that - at best that
is an implementation detail.

Either do something like the force flag I suggested in the last mail,
or give it a name that matches the intention, e.g F_SEAL_FUTURE_WRITE.

> I could make it such that this seal would not be allowed unless F_SEAL_SHRINK
> and F_SEAL_GROW are either previously set, or they are passed along with this
> seal. Would that make more sense to you?

Yes.

> > >  static int memfd_add_seals(struct file *file, unsigned int seals)
> > >  {
> > > @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
> > >  		}
> > >  	}
> > >  
> > > +	if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
> > > +		file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > +
> > 
> > This seems to lack any synchronization for f_mode.
> 
> The f_mode is set when the struct file is first created and then memfd sets
> additional flags in memfd_create. Then later we are changing it here at the
> time of setting the seal. I donot see any possiblity of a race since it is
> impossible to set the seal before memfd_create returns. Could you provide
> more details about what kind of synchronization is needed and what is the
> race condition scenario you were thinking off?

Even if no one changes these specific flags we still need a lock due
to rmw cycles on the field.  For example fadvise can set or clear
FMODE_RANDOM.  It seems to use file->f_lock for synchronization.
Daniel Colascione Oct. 17, 2018, 3:44 p.m. UTC | #5
On Wed, Oct 17, 2018 at 5:08 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Oct 17, 2018 at 03:39:58AM -0700, Joel Fernandes wrote:
>> > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
>> > > To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
>> > > prevents any future mmap and write syscalls from succeeding while
>> > > keeping the existing mmap active. The following program shows the seal
>> > > working in action:
>> >
>> > Where does the FS come from?  I'd rather expect this to be implemented
>> > as a 'force' style flag that applies the seal even if the otherwise
>> > required precondition is not met.
>>
>> The "FS" was meant to convey that the seal is preventing writes at the VFS
>> layer itself, for example vfs_write checks FMODE_WRITE and does not proceed,
>> it instead returns an error if the flag is not set. I could not find a better
>> name for it, I could call it F_SEAL_VFS_WRITE if you prefer?
>
> I don't think there is anything VFS or FS about that - at best that
> is an implementation detail.
>
> Either do something like the force flag I suggested in the last mail,
> or give it a name that matches the intention, e.g F_SEAL_FUTURE_WRITE.

+1

>> > This seems to lack any synchronization for f_mode.
>>
>> The f_mode is set when the struct file is first created and then memfd sets
>> additional flags in memfd_create. Then later we are changing it here at the
>> time of setting the seal. I donot see any possiblity of a race since it is
>> impossible to set the seal before memfd_create returns. Could you provide
>> more details about what kind of synchronization is needed and what is the
>> race condition scenario you were thinking off?
>
> Even if no one changes these specific flags we still need a lock due
> to rmw cycles on the field.  For example fadvise can set or clear
> FMODE_RANDOM.  It seems to use file->f_lock for synchronization.

Compare-and-exchange will suffice, right?
Christoph Hellwig Oct. 17, 2018, 4:19 p.m. UTC | #6
On Wed, Oct 17, 2018 at 08:44:01AM -0700, Daniel Colascione wrote:
> > Even if no one changes these specific flags we still need a lock due
> > to rmw cycles on the field.  For example fadvise can set or clear
> > FMODE_RANDOM.  It seems to use file->f_lock for synchronization.
> 
> Compare-and-exchange will suffice, right?

Only if all users use the compare and exchange, and right now they
don't.
Joel Fernandes Oct. 17, 2018, 5:45 p.m. UTC | #7
On Wed, Oct 17, 2018 at 05:08:29AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 17, 2018 at 03:39:58AM -0700, Joel Fernandes wrote:
> > > > This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
> > > > To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
> > > > prevents any future mmap and write syscalls from succeeding while
> > > > keeping the existing mmap active. The following program shows the seal
> > > > working in action:
> > > 
> > > Where does the FS come from?  I'd rather expect this to be implemented
> > > as a 'force' style flag that applies the seal even if the otherwise
> > > required precondition is not met.
> > 
> > The "FS" was meant to convey that the seal is preventing writes at the VFS
> > layer itself, for example vfs_write checks FMODE_WRITE and does not proceed,
> > it instead returns an error if the flag is not set. I could not find a better
> > name for it, I could call it F_SEAL_VFS_WRITE if you prefer?
> 
> I don't think there is anything VFS or FS about that - at best that
> is an implementation detail.
> 
> Either do something like the force flag I suggested in the last mail,
> or give it a name that matches the intention, e.g F_SEAL_FUTURE_WRITE.
> 

Ok, I agree. I like the name F_SEAL_FUTURE_WRITE you are proposing so I will
use that.

> > I could make it such that this seal would not be allowed unless F_SEAL_SHRINK
> > and F_SEAL_GROW are either previously set, or they are passed along with this
> > seal. Would that make more sense to you?
> 
> Yes.

Cool.

> > > >  static int memfd_add_seals(struct file *file, unsigned int seals)
> > > >  {
> > > > @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
> > > >  		}
> > > >  	}
> > > >  
> > > > +	if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
> > > > +		file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
> > > > +
> > > 
> > > This seems to lack any synchronization for f_mode.
> > 
> > The f_mode is set when the struct file is first created and then memfd sets
> > additional flags in memfd_create. Then later we are changing it here at the
> > time of setting the seal. I donot see any possiblity of a race since it is
> > impossible to set the seal before memfd_create returns. Could you provide
> > more details about what kind of synchronization is needed and what is the
> > race condition scenario you were thinking off?
> 
> Even if no one changes these specific flags we still need a lock due
> to rmw cycles on the field.  For example fadvise can set or clear
> FMODE_RANDOM.  It seems to use file->f_lock for synchronization.

Ok, I will acquire the f_lock before setting these, thanks for the
explanation. Will post updated patches today.

 - Joel
diff mbox series

Patch

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index c98312fa78a5..fe44a2035edf 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7 @@ 
 #define F_SEAL_SHRINK	0x0002	/* prevent file from shrinking */
 #define F_SEAL_GROW	0x0004	/* prevent file from growing */
 #define F_SEAL_WRITE	0x0008	/* prevent writes */
+#define F_SEAL_FS_WRITE	0x0010  /* prevent all write-related syscalls */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 27069518e3c5..9b8855b80de9 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -150,7 +150,8 @@  static unsigned int *memfd_file_seals_ptr(struct file *file)
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 		     F_SEAL_SHRINK | \
 		     F_SEAL_GROW | \
-		     F_SEAL_WRITE)
+		     F_SEAL_WRITE | \
+		     F_SEAL_FS_WRITE)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
@@ -219,6 +220,9 @@  static int memfd_add_seals(struct file *file, unsigned int seals)
 		}
 	}
 
+	if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
+		file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
+
 	*file_seals |= seals;
 	error = 0;