Message ID | 20241118085357.494178-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] vfs: dodge strlen() in vfs_readlink() when ->i_link is populated | expand |
On Mon 18-11-24 09:53:57, Mateusz Guzik wrote: > This gives me about 1.5% speed up when issuing readlink on /initrd.img > on ext4. > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- > > I had this running with the following debug: > > if (strlen(link) != inode->i_size) > printk(KERN_CRIT "mismatch [%s] %l %l\n", link, > strlen(link), inode->i_size); > > nothing popped up Then you didn't run with UDF I guess ;). There inode->i_size is the length of on-disk symlink encoding which is definitely different from the length of the string we return to VFS (it uses weird standards-defined cross OS compatible encoding of a path and I'm not even mentioning its own special encoding of character sets somewhat resembling UCS-2). > I would leave something of that sort in if it was not defeating the > point of the change. > > However, I'm a little worried some crap fs *does not* fill this in > despite populating i_link. > > Perhaps it would make sense to keep the above with the patch hanging out > in next and remove later? > > Anyhow, worst case, should it turn out i_size does not work there are at > least two 4-byte holes which can be used to store the length (and > chances are some existing field can be converted into a union instead). I'm not sure I completely follow your proposal here... Honza
On Mon, Nov 18, 2024 at 12:53 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 18-11-24 09:53:57, Mateusz Guzik wrote: > > This gives me about 1.5% speed up when issuing readlink on /initrd.img > > on ext4. > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > --- > > > > I had this running with the following debug: > > > > if (strlen(link) != inode->i_size) > > printk(KERN_CRIT "mismatch [%s] %l %l\n", link, > > strlen(link), inode->i_size); > > > > nothing popped up > > Then you didn't run with UDF I guess ;). There inode->i_size is the length > of on-disk symlink encoding which is definitely different from the length > of the string we return to VFS (it uses weird standards-defined cross OS > compatible encoding of a path and I'm not even mentioning its own special > encoding of character sets somewhat resembling UCS-2). > Indeed I did not, thanks. :> > > I would leave something of that sort in if it was not defeating the > > point of the change. > > > > However, I'm a little worried some crap fs *does not* fill this in > > despite populating i_link. > > > > Perhaps it would make sense to keep the above with the patch hanging out > > in next and remove later? > > > > Anyhow, worst case, should it turn out i_size does not work there are at > > least two 4-byte holes which can be used to store the length (and > > chances are some existing field can be converted into a union instead). > > I'm not sure I completely follow your proposal here... > I am saying if the size has to be explicitly stored specifically for symlinks, 2 options are: - fill up one of the holes - find a field which is never looked at for symlink inodes and convert into a union I'm going to look into it.
On Mon 18-11-24 13:20:09, Mateusz Guzik wrote: > On Mon, Nov 18, 2024 at 12:53 PM Jan Kara <jack@suse.cz> wrote: > > > > On Mon 18-11-24 09:53:57, Mateusz Guzik wrote: > > > This gives me about 1.5% speed up when issuing readlink on /initrd.img > > > on ext4. > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > --- ... > > > I would leave something of that sort in if it was not defeating the > > > point of the change. > > > > > > However, I'm a little worried some crap fs *does not* fill this in > > > despite populating i_link. > > > > > > Perhaps it would make sense to keep the above with the patch hanging out > > > in next and remove later? > > > > > > Anyhow, worst case, should it turn out i_size does not work there are at > > > least two 4-byte holes which can be used to store the length (and > > > chances are some existing field can be converted into a union instead). > > > > I'm not sure I completely follow your proposal here... > > > > I am saying if the size has to be explicitly stored specifically for > symlinks, 2 options are: > - fill up one of the holes > - find a field which is never looked at for symlink inodes and convert > into a union > > I'm going to look into it. I guess there's limited enthusiasm for complexity to achieve 1.5% improvement in readlink which is not *that* common. But I haven't seen the patch and other guys may have different opinions :) So we'll see. Honza
On Mon, Nov 18, 2024 at 3:41 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 18-11-24 13:20:09, Mateusz Guzik wrote: > > On Mon, Nov 18, 2024 at 12:53 PM Jan Kara <jack@suse.cz> wrote: > > > > > > On Mon 18-11-24 09:53:57, Mateusz Guzik wrote: > > > > This gives me about 1.5% speed up when issuing readlink on /initrd.img > > > > on ext4. > > > > > > > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > > > > --- > ... > > > > I would leave something of that sort in if it was not defeating the > > > > point of the change. > > > > > > > > However, I'm a little worried some crap fs *does not* fill this in > > > > despite populating i_link. > > > > > > > > Perhaps it would make sense to keep the above with the patch hanging out > > > > in next and remove later? > > > > > > > > Anyhow, worst case, should it turn out i_size does not work there are at > > > > least two 4-byte holes which can be used to store the length (and > > > > chances are some existing field can be converted into a union instead). > > > > > > I'm not sure I completely follow your proposal here... > > > > > > > I am saying if the size has to be explicitly stored specifically for > > symlinks, 2 options are: > > - fill up one of the holes > > - find a field which is never looked at for symlink inodes and convert > > into a union > > > > I'm going to look into it. > > I guess there's limited enthusiasm for complexity to achieve 1.5% improvement > in readlink which is not *that* common. But I haven't seen the patch and > other guys may have different opinions :) So we'll see. > I'm thinking an i_opflag "this inode has a cached symlink with size stored in i_linklen", so I don't think there is much in way of complexity here. Then interested filesystems could call a helper like so: diff --git a/mm/shmem.c b/mm/shmem.c index 3d17753afd94..9dedf432ae13 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3870,6 +3870,7 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir, int len; struct inode *inode; struct folio *folio; + const char *link; len = strlen(symname) + 1; if (len > PAGE_SIZE) @@ -3891,12 +3892,13 @@ static int shmem_symlink(struct mnt_idmap *idmap, struct inode *dir, inode->i_size = len-1; if (len <= SHORT_SYMLINK_LEN) { - inode->i_link = kmemdup(symname, len, GFP_KERNEL); - if (!inode->i_link) { + link= kmemdup(symname, len, GFP_KERNEL); + if (!link) { error = -ENOMEM; goto out_remove_offset; } inode->i_op = &shmem_short_symlink_operations; + inode_set_cached_link(link, len); } else { inode_nohighmem(inode); inode->i_mapping->a_ops = &shmem_aops; This is only 1.5% because of other weird slowdowns which don't need to be there, notably putname using atomics. If the other crap was already fixed it would be closer to 5%. Here comes a funny note that readlink is used to implement realpath and vast majority of calls are for directories(!), for which the patch is a nop. However, actual readlinks on symlinks do happen every time you run gcc for example: readlink("/usr/bin/cc", "/etc/alternatives/cc", 1023) = 20 readlink("/etc/alternatives/cc", "/usr/bin/gcc", 1023) = 12 readlink("/usr/bin/gcc", "gcc-12", 1023) = 6 readlink("/usr/bin/gcc-12", "x86_64-linux-gnu-gcc-12", 1023) = 23 readlink("/usr/bin/cc", "/etc/alternatives/cc", 1023) = 20 readlink("/etc/alternatives/cc", "/usr/bin/gcc", 1023) = 12 readlink("/usr/bin/gcc", "gcc-12", 1023) = 6 readlink("/usr/bin/gcc-12", "x86_64-linux-gnu-gcc-12", 1023) = 23 that said, no, this is not earth shattering by any means but I don't see any reason to *object* to it
On Mon, Nov 18, 2024 at 03:51:34PM +0100, Mateusz Guzik wrote: > This is only 1.5% because of other weird slowdowns which don't need to > be there, notably putname using atomics. If the other crap was already > fixed it would be closer to 5%. Describe your plans re putname(), please. Because we are pretty much certain to step on each other's toes here in the coming cycle.
On Mon, Nov 18, 2024 at 4:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Mon, Nov 18, 2024 at 03:51:34PM +0100, Mateusz Guzik wrote: > > > This is only 1.5% because of other weird slowdowns which don't need to > > be there, notably putname using atomics. If the other crap was already > > fixed it would be closer to 5%. > > Describe your plans re putname(), please. Because we are pretty much > certain to step on each other's toes here in the coming cycle. I don't have immediate plans for putname, but I posted a total hack some time ago in relation to it: https://lore.kernel.org/all/20240604132448.101183-1-mjguzik@gmail.com/ along with this: > The race is only possible with io_uring which has a dedicated entry > point, thus a getname variant which takes it into account could store > the need to use atomics as a flag in struct filename. To that end > getname could take a boolean indicating this, fronted with some inlines > and the current entry point renamed to __getname_flags to hide it. > Option B is to add a routine which "upgrades" to atomics after getname > returns, but that's a littly fishy vs audit_reusename. > At the end of the day all spots which modify the ref could branch on the > atomics flag. I ended up not getting this done because there was something real off about name caching for audit, I don't remember the details but I was not confident the code is correct as is. Anyhow, someone else sorting this out is most welcome. Apart from the few things I posted I have no immediate plans to mess with anything vfs (I do have some plans to reduce the cost of memcg though).
> Apart from the few things I posted I have no immediate plans to mess > with anything vfs (I do have some plans to reduce the cost of memcg > though). Woh woh woh, you're our new perf specialist. You can't just wander off. ;)
diff --git a/fs/namei.c b/fs/namei.c index 9d30c7aa9aa6..7aced8aca0f6 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -5272,19 +5272,16 @@ SYSCALL_DEFINE2(rename, const char __user *, oldname, const char __user *, newna getname(newname), 0); } -int readlink_copy(char __user *buffer, int buflen, const char *link) +int readlink_copy(char __user *buffer, int buflen, const char *link, int linklen) { - int len = PTR_ERR(link); - if (IS_ERR(link)) - goto out; + int copylen; - len = strlen(link); - if (len > (unsigned) buflen) - len = buflen; - if (copy_to_user(buffer, link, len)) - len = -EFAULT; -out: - return len; + copylen = linklen; + if (unlikely(copylen > (unsigned) buflen)) + copylen = buflen; + if (copy_to_user(buffer, link, copylen)) + copylen = -EFAULT; + return copylen; } /** @@ -5317,13 +5314,15 @@ int vfs_readlink(struct dentry *dentry, char __user *buffer, int buflen) } link = READ_ONCE(inode->i_link); - if (!link) { - link = inode->i_op->get_link(dentry, inode, &done); - if (IS_ERR(link)) - return PTR_ERR(link); + if (link) + return readlink_copy(buffer, buflen, link, inode->i_size); + + link = inode->i_op->get_link(dentry, inode, &done); + res = PTR_ERR(link); + if (!IS_ERR(link)) { + res = readlink_copy(buffer, buflen, link, strlen(link)); + do_delayed_call(&done); } - res = readlink_copy(buffer, buflen, link); - do_delayed_call(&done); return res; } EXPORT_SYMBOL(vfs_readlink); @@ -5391,10 +5390,14 @@ EXPORT_SYMBOL(page_put_link); int page_readlink(struct dentry *dentry, char __user *buffer, int buflen) { + const char *link; + int res; + DEFINE_DELAYED_CALL(done); - int res = readlink_copy(buffer, buflen, - page_get_link(dentry, d_inode(dentry), - &done)); + link = page_get_link(dentry, d_inode(dentry), &done); + res = PTR_ERR(link); + if (!IS_ERR(link)) + res = readlink_copy(buffer, buflen, link, strlen(link)); do_delayed_call(&done); return res; } diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index 8e159fc78c0a..c610224faf10 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -83,7 +83,7 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) { res = ns_get_name(name, sizeof(name), task, ns_ops); if (res >= 0) - res = readlink_copy(buffer, buflen, name); + res = readlink_copy(buffer, buflen, name, strlen(name)); } put_task_struct(task); return res; diff --git a/include/linux/fs.h b/include/linux/fs.h index 972147da71f9..7d456db6a381 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3351,7 +3351,7 @@ extern const struct file_operations generic_ro_fops; #define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m)) -extern int readlink_copy(char __user *, int, const char *); +extern int readlink_copy(char __user *, int, const char *, int); extern int page_readlink(struct dentry *, char __user *, int); extern const char *page_get_link(struct dentry *, struct inode *, struct delayed_call *); diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c index 01b923d97a44..60959cfba672 100644 --- a/security/apparmor/apparmorfs.c +++ b/security/apparmor/apparmorfs.c @@ -2611,7 +2611,7 @@ static int policy_readlink(struct dentry *dentry, char __user *buffer, res = snprintf(name, sizeof(name), "%s:[%lu]", AAFS_NAME, d_inode(dentry)->i_ino); if (res > 0 && res < sizeof(name)) - res = readlink_copy(buffer, buflen, name); + res = readlink_copy(buffer, buflen, name, strlen(name)); else res = -ENOENT;
This gives me about 1.5% speed up when issuing readlink on /initrd.img on ext4. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- I had this running with the following debug: if (strlen(link) != inode->i_size) printk(KERN_CRIT "mismatch [%s] %l %l\n", link, strlen(link), inode->i_size); nothing popped up I would leave something of that sort in if it was not defeating the point of the change. However, I'm a little worried some crap fs *does not* fill this in despite populating i_link. Perhaps it would make sense to keep the above with the patch hanging out in next and remove later? Anyhow, worst case, should it turn out i_size does not work there are at least two 4-byte holes which can be used to store the length (and chances are some existing field can be converted into a union instead). Bench: $ cat tests/readlink1.c #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <string.h> char *testcase_description = "readlink /initrd.img"; void testcase(unsigned long long *iterations, unsigned long nr) { char *tmplink = "/initrd.img"; char buf[1024]; while (1) { int error = readlink(tmplink, buf, sizeof(buf)); assert(error > 0); (*iterations)++; } } fs/namei.c | 43 ++++++++++++++++++---------------- fs/proc/namespaces.c | 2 +- include/linux/fs.h | 2 +- security/apparmor/apparmorfs.c | 2 +- 4 files changed, 26 insertions(+), 23 deletions(-)