Message ID | 20190730014924.2193-10-deepa.kernel@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: Add support for timestamp limits | expand |
On Mon, Jul 29, 2019 at 06:49:13PM -0700, Deepa Dinamani wrote: > ext4 has different overflow limits for max filesystem > timestamps based on the extra bytes available. > > The timestamp limits are calculated according to the > encoding table in > a4dad1ae24f85i(ext4: Fix handling of extended tv_sec): > > * extra msb of adjust for signed > * epoch 32-bit 32-bit tv_sec to > * bits time decoded 64-bit tv_sec 64-bit tv_sec valid time range > * 0 0 1 -0x80000000..-0x00000001 0x000000000 1901-12-13..1969-12-31 > * 0 0 0 0x000000000..0x07fffffff 0x000000000 1970-01-01..2038-01-19 > * 0 1 1 0x080000000..0x0ffffffff 0x100000000 2038-01-19..2106-02-07 > * 0 1 0 0x100000000..0x17fffffff 0x100000000 2106-02-07..2174-02-25 > * 1 0 1 0x180000000..0x1ffffffff 0x200000000 2174-02-25..2242-03-16 > * 1 0 0 0x200000000..0x27fffffff 0x200000000 2242-03-16..2310-04-04 > * 1 1 1 0x280000000..0x2ffffffff 0x300000000 2310-04-04..2378-04-22 > * 1 1 0 0x300000000..0x37fffffff 0x300000000 2378-04-22..2446-05-10 My recollection of ext4 has gotten rusty, so this could be a bogus question: Say you have a filesystem with s_inode_size > 128 where not all of the ondisk inodes have been upgraded to i_extra_isize > 0 and therefore don't support nanoseconds or times beyond 2038. I think this happens on ext3 filesystems that reserved extra space for inode attrs that are subsequently converted to ext4? In any case, that means that you have some inodes that support 34-bit tv_sec and some inodes that only support 32-bit tv_sec. For the inodes with 32-bit tv_sec, I think all that happens is that a large timestamp will be truncated further, right? And no mount time warning because at least /some/ of the inodes are ready to go for the next 30 years? --D > Note that the time limits are not correct for deletion times. > > Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> > Reviewed-by: Andreas Dilger <adilger@dilger.ca> > Cc: tytso@mit.edu > Cc: adilger.kernel@dilger.ca > Cc: linux-ext4@vger.kernel.org > --- > fs/ext4/ext4.h | 4 ++++ > fs/ext4/super.c | 17 +++++++++++++++-- > 2 files changed, 19 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 1cb67859e051..3f13cf12ae7f 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1631,6 +1631,10 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei) > > #define EXT4_GOOD_OLD_INODE_SIZE 128 > > +#define EXT4_EXTRA_TIMESTAMP_MAX (((s64)1 << 34) - 1 + S32_MIN) > +#define EXT4_NON_EXTRA_TIMESTAMP_MAX S32_MAX > +#define EXT4_TIMESTAMP_MIN S32_MIN > + > /* > * Feature set definitions > */ > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index 4079605d437a..3ea2d60f33aa 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -4035,8 +4035,21 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > sbi->s_inode_size); > goto failed_mount; > } > - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) > - sb->s_time_gran = 1 << (EXT4_EPOCH_BITS - 2); > + /* > + * i_atime_extra is the last extra field available for [acm]times in > + * struct ext4_inode. Checking for that field should suffice to ensure > + * we have extra space for all three. > + */ > + if (sbi->s_inode_size >= offsetof(struct ext4_inode, i_atime_extra) + > + sizeof(((struct ext4_inode *)0)->i_atime_extra)) { > + sb->s_time_gran = 1; > + sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX; > + } else { > + sb->s_time_gran = NSEC_PER_SEC; > + sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX; > + } > + > + sb->s_time_min = EXT4_TIMESTAMP_MIN; > } > > sbi->s_desc_size = le16_to_cpu(es->s_desc_size); > -- > 2.17.1 >
On Wed, Jul 31, 2019 at 8:26 AM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Mon, Jul 29, 2019 at 06:49:13PM -0700, Deepa Dinamani wrote: > > ext4 has different overflow limits for max filesystem > > timestamps based on the extra bytes available. > > > > The timestamp limits are calculated according to the > > encoding table in > > a4dad1ae24f85i(ext4: Fix handling of extended tv_sec): > > > > * extra msb of adjust for signed > > * epoch 32-bit 32-bit tv_sec to > > * bits time decoded 64-bit tv_sec 64-bit tv_sec valid time range > > * 0 0 1 -0x80000000..-0x00000001 0x000000000 1901-12-13..1969-12-31 > > * 0 0 0 0x000000000..0x07fffffff 0x000000000 1970-01-01..2038-01-19 > > * 0 1 1 0x080000000..0x0ffffffff 0x100000000 2038-01-19..2106-02-07 > > * 0 1 0 0x100000000..0x17fffffff 0x100000000 2106-02-07..2174-02-25 > > * 1 0 1 0x180000000..0x1ffffffff 0x200000000 2174-02-25..2242-03-16 > > * 1 0 0 0x200000000..0x27fffffff 0x200000000 2242-03-16..2310-04-04 > > * 1 1 1 0x280000000..0x2ffffffff 0x300000000 2310-04-04..2378-04-22 > > * 1 1 0 0x300000000..0x37fffffff 0x300000000 2378-04-22..2446-05-10 > > My recollection of ext4 has gotten rusty, so this could be a bogus > question: > > Say you have a filesystem with s_inode_size > 128 where not all of the > ondisk inodes have been upgraded to i_extra_isize > 0 and therefore > don't support nanoseconds or times beyond 2038. I think this happens on > ext3 filesystems that reserved extra space for inode attrs that are > subsequently converted to ext4? > > In any case, that means that you have some inodes that support 34-bit > tv_sec and some inodes that only support 32-bit tv_sec. For the inodes > with 32-bit tv_sec, I think all that happens is that a large timestamp > will be truncated further, right? > > And no mount time warning because at least /some/ of the inodes are > ready to go for the next 30 years? I'm confused about ext3 being converted to ext4. If the converted inodes have extra space, then ext4_iget() will start using the extra space when it modifies the on disk inode, won't it? But, if there are 32 bit tv_sec and 34 bit tv_sec in a superblock then from this macro below, if an inode has space for extra bits in timestamps, it uses it. Otherwise, only the first 32 bits are copied to the on disk timestamp. This matches the behavior today for 32 bit tv_sec. But, the 34 bit tv_sec has a better behavior after the series because of the clamping and warning. #define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \ do { \ (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec); \ if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) {\ (raw_inode)->xtime ## _extra = \ ext4_encode_extra_time(&(inode)->xtime); \ } \ } while (0) I'm not sure if this corner case if important. Maybe the maintainers can help me here. If this is important, then the inode time updates for an ext4 inode should always be through ext4_setattr() and we can clamp the timestamps there as a special case. And, this patch can be added separately? Thanks, Deepa
On Thu, Aug 01, 2019 at 12:18:28PM -0700, Deepa Dinamani wrote: > > Say you have a filesystem with s_inode_size > 128 where not all of the > > ondisk inodes have been upgraded to i_extra_isize > 0 and therefore > > don't support nanoseconds or times beyond 2038. I think this happens on > > ext3 filesystems that reserved extra space for inode attrs that are > > subsequently converted to ext4? > > I'm confused about ext3 being converted to ext4. If the converted > inodes have extra space, then ext4_iget() will start using the extra > space when it modifies the on disk inode, won't it?i It is possible that you can have an ext3 file system with (for example) 256 byte inodes, and all of the extra space was used for extended attributes, then ext4 won't have the extra space available. This is going toh be on an inode-by-inode basis, and if an extended attribute is motdified or deleted, the space would become available,t and then inode would start getting a higher resolution timestamp. I really don't think it's worth worrying about that, though. It's highly unlikely ext3 file systems will be still be in service by the time it's needed in 2038. And if so, it's highly unlikely they would be converted to ext4. - Ted
On Fri, Aug 2, 2019 at 12:43 AM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Thu, Aug 01, 2019 at 12:18:28PM -0700, Deepa Dinamani wrote: > > > Say you have a filesystem with s_inode_size > 128 where not all of the > > > ondisk inodes have been upgraded to i_extra_isize > 0 and therefore > > > don't support nanoseconds or times beyond 2038. I think this happens on > > > ext3 filesystems that reserved extra space for inode attrs that are > > > subsequently converted to ext4? > > > > I'm confused about ext3 being converted to ext4. If the converted > > inodes have extra space, then ext4_iget() will start using the extra > > space when it modifies the on disk inode, won't it?i > > It is possible that you can have an ext3 file system with (for > example) 256 byte inodes, and all of the extra space was used for > extended attributes, then ext4 won't have the extra space available. > This is going toh be on an inode-by-inode basis, and if an extended > attribute is motdified or deleted, the space would become available,t > and then inode would start getting a higher resolution timestamp. Is it correct to assume that this kind of file would have to be created using the ext3.ko file system implementation that was removed in linux-4.3, but not using ext2.ko or ext4.ko (which would always set the extended timestamps even in "-t ext2" or "-t ext3" mode)? I tried to reproduce this on a modern kernel and with and moderately old debugfs (1.42.13) but failed. > I really don't think it's worth worrying about that, though. It's > highly unlikely ext3 file systems will be still be in service by the > time it's needed in 2038. And if so, it's highly unlikely they would > be converted to ext4. As the difference is easily visible even before y2038 by using utimensat(old_inode, future_date) on a file, we should at least decide what the sanest behavior is that we can easily implement, and then document what is expected to happen here. If we check for s_min_extra_isize instead of s_inode_size to determine s_time_gran/s_time_max, we would warn at mount time as well as and consistently truncate all timestamps to full 32-bit seconds, regardless of whether there is actually space or not. Alternatively, we could warn if s_min_extra_isize is too small, but use i_inode_size to determine s_time_gran/s_time_max anyway. From looking at e2fsprogs git history, I see that s_min_extra_isize has always been set by mkfs since 2008, but I'm not sure if there would have been a case in which it remains set but the ext3.ko would ignore it and use that space anyway. Arnd
On Fri, Aug 02, 2019 at 12:39:41PM +0200, Arnd Bergmann wrote: > Is it correct to assume that this kind of file would have to be > created using the ext3.ko file system implementation that was > removed in linux-4.3, but not usiing ext2.ko or ext4.ko (which > would always set the extended timestamps even in "-t ext2" or > "-t ext3" mode)? Correct. Some of the enterprise distro's were using ext4 to support "mount -t ext3" even before 4.3. There's a CONFIG option to enable using ext4 for ext2 or ext3 if they aren't enabled. > If we check for s_min_extra_isize instead of s_inode_size > to determine s_time_gran/s_time_max, we would warn > at mount time as well as and consistently truncate all > timestamps to full 32-bit seconds, regardless of whether > there is actually space or not. > > Alternatively, we could warn if s_min_extra_isize is > too small, but use i_inode_size to determine > s_time_gran/s_time_max anyway. Even with ext4, s_min_extra_isize doesn't guarantee that will be able to expand the inode. This can fail if (a) we aren't able to expand existing the transaction handle because there isn't enough space in the journal, or (b) there is already an external xattr block which is also full, so there is no space to evacuate an extended attribute out of the inode's extra space. We could be more aggressive by trying to expand make room in the inode in ext4_iget (when we're reading in the inode, assuming the file system isn't mounted read/only), instead of in the middle of mark_inode_dirty(). That will eliminate failure mode (a) --- which is statistically rare --- but it won't eliminate failure mode (b). Ultimately, the question is which is worse: having a timestamp be wrong, or randomly dropping an xattr from the inode to make room for the extended timestamp. We've come down on it being less harmful to have the timestamp be wrong. But again, this is a pretty rare case. I'm not convinced it's worth stressing about, since it's going to require multiple things to go wrong before a timestamp will be bad. - Ted
On Fri, Aug 2, 2019 at 5:43 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Fri, Aug 02, 2019 at 12:39:41PM +0200, Arnd Bergmann wrote: > > Is it correct to assume that this kind of file would have to be > > created using the ext3.ko file system implementation that was > > removed in linux-4.3, but not usiing ext2.ko or ext4.ko (which > > would always set the extended timestamps even in "-t ext2" or > > "-t ext3" mode)? > > Correct. Some of the enterprise distro's were using ext4 to support > "mount -t ext3" even before 4.3. There's a CONFIG option to enable > using ext4 for ext2 or ext3 if they aren't enabled. > > > If we check for s_min_extra_isize instead of s_inode_size > > to determine s_time_gran/s_time_max, we would warn > > at mount time as well as and consistently truncate all > > timestamps to full 32-bit seconds, regardless of whether > > there is actually space or not. > > > > Alternatively, we could warn if s_min_extra_isize is > > too small, but use i_inode_size to determine > > s_time_gran/s_time_max anyway. > > Even with ext4, s_min_extra_isize doesn't guarantee that will be able > to expand the inode. This can fail if (a) we aren't able to expand > existing the transaction handle because there isn't enough space in > the journal, or (b) there is already an external xattr block which is > also full, so there is no space to evacuate an extended attribute out > of the inode's extra space. I must have misunderstood what the field says. I expected that with s_min_extra_isize set beyond the nanosecond fields, there would be a guarantee that all inodes have at least as many extra bytes already allocated. What circumstances would lead to an i_extra_isize smaller than s_min_extra_isize? > We could be more aggressive by trying to expand make room in the inode > in ext4_iget (when we're reading in the inode, assuming the file > system isn't mounted read/only), instead of in the middle of > mark_inode_dirty(). That will eliminate failure mode (a) --- which is > statistically rare --- but it won't eliminate failure mode (b). > > Ultimately, the question is which is worse: having a timestamp be > wrong, or randomly dropping an xattr from the inode to make room for > the extended timestamp. We've come down on it being less harmful to > have the timestamp be wrong. > > But again, this is a pretty rare case. I'm not convinced it's worth > stressing about, since it's going to require multiple things to go > wrong before a timestamp will be bad. Agreed, I'm not overly worried about this happening frequently, I'd just feel better if we could reliably warn about the few instances that might theoretically be affected. Arnd
On Fri, Aug 02, 2019 at 09:00:52PM +0200, Arnd Bergmann wrote: > > I must have misunderstood what the field says. I expected that > with s_min_extra_isize set beyond the nanosecond fields, there > would be a guarantee that all inodes have at least as many > extra bytes already allocated. What circumstances would lead to > an i_extra_isize smaller than s_min_extra_isize? When allocating new inodes, i_extra_isize is set to s_want_extra_isize. When modifying existing inodes, if i_extra_isize is less than s_min_extra_isize, then we will attempt to move out extended attribute(s) to the external xattr block. So the s_min_extra_isize field is not a guarantee, but rather an aspirationa goal. The idea is that at some point when we want to enable a new feature, which needs more extra inode space, we can adjust s_min_extra_size and s_want_extra_size, and the file system will migrate things to meet these constraints. The plan was to teach e2fsck how to fix all of the inodes to meet theh s_min_extra_size value, but that never got implemented, and we even then, e2fsck would have to deal with the case where tit couldn't move the extended attribute(s) in the inode out, because there was no place to put them. In practice, this hasn't been that much of a limitation because we haven't been adding that many extra inode fields. Keep in mind that Red Hat for example, has explicitly said they will *never* support adding new features to an existing file system. Their only supported method is back up the file system, reformat it with the new file system features, and then restore the file system. Of course, if the backup/restore includes backing up the extended attributes, and then restoring them, the xattr restore could fail, unless the user also increased the inode size (e.g., from 256 bytes to 512 bytes). Getting this right in the general case is *hard*. Fortunately, the corner cases really don't happen that often in practice, at least not for pure Linux workloads. Windows which can have arbitrarily large security id's and ACL's might make this harder, of course --- although ext4's EA in inode feature would make this better, modulo needing to write more complex file system code to handle moving xattrs around. Since the extended timestamps were one of the first extra inode fields to be added, I strongly suggest that we not try to borrow trouble. Solving the general case problem is *hard*. - Ted
On Fri, Aug 2, 2019 at 11:39 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Fri, Aug 02, 2019 at 09:00:52PM +0200, Arnd Bergmann wrote: > > > > I must have misunderstood what the field says. I expected that > > with s_min_extra_isize set beyond the nanosecond fields, there > > would be a guarantee that all inodes have at least as many > > extra bytes already allocated. What circumstances would lead to > > an i_extra_isize smaller than s_min_extra_isize? > > When allocating new inodes, i_extra_isize is set to > s_want_extra_isize. When modifying existing inodes, if i_extra_isize > is less than s_min_extra_isize, then we will attempt to move out > extended attribute(s) to the external xattr block. So the > s_min_extra_isize field is not a guarantee, but rather an aspirationa > goal. The idea is that at some point when we want to enable a new > feature, which needs more extra inode space, we can adjust > s_min_extra_size and s_want_extra_size, and the file system will > migrate things to meet these constraints. I see in the ext4 code that we always try to expand i_extra_size to s_want_extra_isize in ext4_mark_inode_dirty(), and that s_want_extra_isize is always at least s_min_extra_isize, so we constantly try to expand the inode to fit. What I still don't see is how any inode on the file system image could have ended up with less than s_min_extra_isize in the first place if s_min_extra_isize is never modified and all inodes in the file system would have originally been created with i_extra_isize >= s_min_extra_isize if EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE is set. Did older versions of ext4 or ext3 ignore s_min_extra_isize when creating inodes despite EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE, or is there another possibility I'm missing? > Since the extended timestamps were one of the first extra inode fields > to be added, I strongly suggest that we not try to borrow trouble. > Solving the general case problem is *hard*. As I said before, I absolutely don't suggest we solve the problem of reliably setting the timestamps, I'm just trying to find out if there is a way to know for sure that it cannot happen and alert the user otherwise. So far I think we have concluded - just checking s_inode_size is not sufficient because ext3 may have created inodes with s_extra_isize too small - checking s_min_extra_isize may not be sufficient either, for similar reasons I don't yet fully understand (see above). If there is any other way to be sure that the file system has never been mounted as a writable ext3, maybe that can be used instead? Arnd
On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote: > > I see in the ext4 code that we always try to expand i_extra_size > to s_want_extra_isize in ext4_mark_inode_dirty(), and that > s_want_extra_isize is always at least s_min_extra_isize, so > we constantly try to expand the inode to fit. Yes, we *try*. But we may not succeed. There may actually be a problem here if the cause is due to there simply is no space in the external xattr block, so we might try and try every time we try to modify that inode, and it would be a performance mess. If it's due to there being no room in the current transaction, then it's highly likely it will succeed the next time. > Did older versions of ext4 or ext3 ignore s_min_extra_isize > when creating inodes despite > EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE, > or is there another possibility I'm missing? s_min_extra_isize could get changed in order to make room for some new file system feature --- such as extended timestamps. That's how we might take an old ext3 file system with an inode size > 128, and try to evacuate space for extended timestamps, on a best efforts basis. And since it's best efforts is why Red Hat refuses to support that case. It'll work 99.9% of the time, but they don't want to deal with the 0.01% cases showing up at their help desk. If you want to pretend that file systems never get upgraded, then life is much simpler. The general approach is that for less-sophisticated customers (e.g., most people running enterprise distros) file system upgrades are not a thing. But for sophisticated users, we do try to make thing work for people who are aware of the risks / caveats / rough edges. Google won't have been able to upgrade thousands and thousands of servers in data centers all over the world if we limited ourselves to Red Hat's support restrictions. Backup / reformat / restore really isn't a practical rollout strategy for many exabytes of file systems. It sounds like your safety checks / warnings are mostly targeted at low-information customers, no? - Ted
On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote: > > > > I see in the ext4 code that we always try to expand i_extra_size > > to s_want_extra_isize in ext4_mark_inode_dirty(), and that > > s_want_extra_isize is always at least s_min_extra_isize, so > > we constantly try to expand the inode to fit. > > Yes, we *try*. But we may not succeed. There may actually be a > problem here if the cause is due to there simply is no space in the > external xattr block, so we might try and try every time we try to > modify that inode, and it would be a performance mess. If it's due to > there being no room in the current transaction, then it's highly > likely it will succeed the next time. > > > Did older versions of ext4 or ext3 ignore s_min_extra_isize > > when creating inodes despite > > EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE, > > or is there another possibility I'm missing? > > s_min_extra_isize could get changed in order to make room for some new > file system feature --- such as extended timestamps. Ok, that explains it. I assumed s_min_extra_isize was meant to not be modifiable, and did not find a way to change it using the kernel or tune2fs, but now I can see that debugfs can set it. > If you want to pretend that file systems never get upgraded, then life > is much simpler. The general approach is that for less-sophisticated > customers (e.g., most people running enterprise distros) file system > upgrades are not a thing. But for sophisticated users, we do try to > make thing work for people who are aware of the risks / caveats / > rough edges. Google won't have been able to upgrade thousands and > thousands of servers in data centers all over the world if we limited > ourselves to Red Hat's support restrictions. Backup / reformat / > restore really isn't a practical rollout strategy for many exabytes of > file systems. > > It sounds like your safety checks / warnings are mostly targeted at > low-information customers, no? Yes, that seems like a reasonable compromise: just warn based on s_min_extra_isize, and assume that anyone who used debugfs to set s_min_extra_isize to a higher value from an ext3 file system during the migration to ext4 was aware of the risks already. That leaves the question of what we should set the s_time_gran and s_time_max to on a superblock with s_min_extra_isize<16 and s_want_extra_isize>=16. If we base it on s_min_extra_isize, we never try to set a timestamp later than 2038 and so will never fail, but anyone with a grandfathered s_min_extra_isize from ext3 won't be able to set extended timestamps on any files any more. Based on s_want_extra_isize we would keep the current behavior, but could add a custom warning in the ext4 code about the small s_min_extra_isize indicating a theoretical problem. Arnd
On Aug 3, 2019, at 2:24 PM, Arnd Bergmann <arnd@arndb.de> wrote: > > On Sat, Aug 3, 2019 at 6:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: >> >> On Sat, Aug 03, 2019 at 11:30:22AM +0200, Arnd Bergmann wrote: >>> >>> I see in the ext4 code that we always try to expand i_extra_size >>> to s_want_extra_isize in ext4_mark_inode_dirty(), and that >>> s_want_extra_isize is always at least s_min_extra_isize, so >>> we constantly try to expand the inode to fit. >> >> Yes, we *try*. But we may not succeed. There may actually be a >> problem here if the cause is due to there simply is no space in the >> external xattr block, so we might try and try every time we try to >> modify that inode, and it would be a performance mess. If it's due to >> there being no room in the current transaction, then it's highly >> likely it will succeed the next time. >> >>> Did older versions of ext4 or ext3 ignore s_min_extra_isize >>> when creating inodes despite >>> EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE, >>> or is there another possibility I'm missing? >> >> s_min_extra_isize could get changed in order to make room for some new >> file system feature --- such as extended timestamps. > > Ok, that explains it. I assumed s_min_extra_isize was meant to > not be modifiable, and did not find a way to change it using the > kernel or tune2fs, but now I can see that debugfs can set it. > >> If you want to pretend that file systems never get upgraded, then life >> is much simpler. The general approach is that for less-sophisticated >> customers (e.g., most people running enterprise distros) file system >> upgrades are not a thing. But for sophisticated users, we do try to >> make thing work for people who are aware of the risks / caveats / >> rough edges. Google won't have been able to upgrade thousands and >> thousands of servers in data centers all over the world if we limited >> ourselves to Red Hat's support restrictions. Backup / reformat / >> restore really isn't a practical rollout strategy for many exabytes of >> file systems. >> >> It sounds like your safety checks / warnings are mostly targeted at >> low-information customers, no? > > Yes, that seems like a reasonable compromise: just warn based > on s_min_extra_isize, and assume that anyone who used debugfs > to set s_min_extra_isize to a higher value from an ext3 file system > during the migration to ext4 was aware of the risks already. > > That leaves the question of what we should set the s_time_gran > and s_time_max to on a superblock with s_min_extra_isize<16 > and s_want_extra_isize>=16. > > If we base it on s_min_extra_isize, we never try to set a timestamp > later than 2038 and so will never fail, but anyone with a grandfathered > s_min_extra_isize from ext3 won't be able to set extended > timestamps on any files any more. Based on s_want_extra_isize > we would keep the current behavior, but could add a custom > warning in the ext4 code about the small s_min_extra_isize > indicating a theoretical problem. I think it makes the most sense to always try to set timestamps on inodes that have enough space for them. The chance of running into a filesystem with 256-byte inode size but *no* space in the inode to store an extended timestamp, but is *also* being modified by a new kernel after 2038 is vanishingly small. This would require formatting the filesystem with non-default mke2fs for ext3, using the filesystem and storing enough xattrs on inodes that there isn't space for 12 bytes of extra isize, and using it for 30+ years without upgrading to use ext4 (which will also try to expand the inode to store the nsec timestamps) and then modifying the inode after 2038. Rather than printing a warning at mount time (which may be confusing to users for a problem they may never see), it makes sense to only print such a warning in the vanishingly small case that someone actually tries to modify the inode timestamp but it doesn't fit, rather than on the theoretical case that may never happen. Cheers, Andreas
> Rather than printing a warning at mount time (which may be confusing > to users for a problem they may never see), it makes sense to only > print such a warning in the vanishingly small case that someone actually > tries to modify the inode timestamp but it doesn't fit, rather than on > the theoretical case that may never happen. Arnd and I were discussing and we came to a similar conclusion that we would not warn at mount. Arnd suggested maybe we include a pr_warn_ratelimited() when we do write to these special inodes. In general, there is an agreement to leave the fs granularity to a higher resolution for such super blocks. And hence, the timestamps for these special cases will never be clamped in memory.(Assuming we do not want to make more changes and try to do something like __ext4_expand_extra_isize() for in memory inode updates) We could easily try to clamp these timestamps when they get written out to the disk by modifying the EXT4_INODE_SET_XTIME to include such a clamp. And, at this point we could also warn. If this is acceptable, I could post an update. -Deepa
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 1cb67859e051..3f13cf12ae7f 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1631,6 +1631,10 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei) #define EXT4_GOOD_OLD_INODE_SIZE 128 +#define EXT4_EXTRA_TIMESTAMP_MAX (((s64)1 << 34) - 1 + S32_MIN) +#define EXT4_NON_EXTRA_TIMESTAMP_MAX S32_MAX +#define EXT4_TIMESTAMP_MIN S32_MIN + /* * Feature set definitions */ diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 4079605d437a..3ea2d60f33aa 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4035,8 +4035,21 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sbi->s_inode_size); goto failed_mount; } - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE) - sb->s_time_gran = 1 << (EXT4_EPOCH_BITS - 2); + /* + * i_atime_extra is the last extra field available for [acm]times in + * struct ext4_inode. Checking for that field should suffice to ensure + * we have extra space for all three. + */ + if (sbi->s_inode_size >= offsetof(struct ext4_inode, i_atime_extra) + + sizeof(((struct ext4_inode *)0)->i_atime_extra)) { + sb->s_time_gran = 1; + sb->s_time_max = EXT4_EXTRA_TIMESTAMP_MAX; + } else { + sb->s_time_gran = NSEC_PER_SEC; + sb->s_time_max = EXT4_NON_EXTRA_TIMESTAMP_MAX; + } + + sb->s_time_min = EXT4_TIMESTAMP_MIN; } sbi->s_desc_size = le16_to_cpu(es->s_desc_size);