Message ID | cphtxla2se4gavql3re5xju7mqxld4rp6q4wbqephb6by5ibfa@5myddcaxerpb (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] bcachefs fixes for 6.12-rc2 | expand |
On Sat, 5 Oct 2024 at 11:35, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Several more filesystems repaired, thank you to the users who have been > providing testing. The snapshots + unlinked fixes on top of this are > posted here: I'm getting really fed up here Kent. These have commit times from last night. Which makes me wonder how much testing they got. And before you start whining - again - about how you are fixing bugs, let me remind you about the build failures you had on big-endian machines because your patches had gotten ZERO testing outside your tree. That was just last week, and I'm getting the strong feeling that absolutely nothing was learnt from the experience. I have pulled this, but I searched for a couple of the commit messages on the lists, and found *nothing* (ok, I found your pull request, which obviously mentioned the first line of the commit messages). I'm seriously thinking about just stopping pulling from you, because I simply don't see you improving on your model. If you want to have an experimental tree, you can damn well have one outside the mainline kernel. I've told you before, and nothing seems to really make you understand. I was hoping and expecting that bcachefs being mainlined would actually help development. It has not. You're still basically the only developer, there's no real sign that that will change, and you seem to feel like sending me untested stuff that nobody else has ever seen the day before the next rc release is just fine. You're a smart person. I feel like I've given you enough hints. Why don't you sit back and think about it, and let's make it clear: you have exactly two choices here: (a) play better with others (b) take your toy and go home Those are the choices. Linus
The pull request you sent on Sat, 5 Oct 2024 14:35:18 -0400:
> git://evilpiepirate.org/bcachefs.git tags/bcachefs-2024-10-05
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8f602276d3902642fdc3429b548d73c745446601
Thank you!
On Sat, Oct 05, 2024 at 03:34:56PM GMT, Linus Torvalds wrote: > On Sat, 5 Oct 2024 at 11:35, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Several more filesystems repaired, thank you to the users who have been > > providing testing. The snapshots + unlinked fixes on top of this are > > posted here: > > I'm getting really fed up here Kent. > > These have commit times from last night. Which makes me wonder how > much testing they got. The /commit/ dates are from last night, because I polish up commit messages and reorder until the last might (I always push smaller fixes up front and fixes that are likely to need rework to the back). The vast majority of those fixes are all ~2 weeks old. > And before you start whining - again - about how you are fixing bugs, > let me remind you about the build failures you had on big-endian > machines because your patches had gotten ZERO testing outside your > tree. No, there simply aren't that many people running big endian. I have users building and running my trees on a daily basis. If I push something broken before I go to bed I have bug reports waiting for me _the next morning_ when I wake up. > That was just last week, and I'm getting the strong feeling that > absolutely nothing was learnt from the experience. > > I have pulled this, but I searched for a couple of the commit messages > on the lists, and found *nothing* (ok, I found your pull request, > which obviously mentioned the first line of the commit messages). > > I'm seriously thinking about just stopping pulling from you, because I > simply don't see you improving on your model. If you want to have an > experimental tree, you can damn well have one outside the mainline > kernel. I've told you before, and nothing seems to really make you > understand. At this point, it's honestly debatable whether the experimental label should apply. I'm getting bug reports that talk about production use and working on metadata dumps where the superblock indicates the filesystem has been in continuous use for years. And many, many people talking about how even at this relatively early point it doesn't fall over like btrfs does. Let that sink in. Btrfs has been mainline for years, and it still craps out on people. I was just in a meeting two days ago, closing funding, and a big reason it was an easy sell was because they have to run btrfs in _read only_ mode because otherwise it craps out. So if the existing process, the existing way of doing things, hasn't been able to get btrfs to a point where people can rely on it after 10 years - perhaps you and the community don't know quite as much as you think you do about the realities of what it takes to ship a working filesystem. And from where I sit, on the bcachefs side of things, things are going smoothly and quickly. Bug reports are diminishing in frequency and severity, even as userbase is going up; distros are picking it up (just not Debian and Fedora); the timeline I laid out at LSF is still looking reasonable. > I was hoping and expecting that bcachefs being mainlined would > actually help development. It has not. You're still basically the > only developer, there's no real sign that that will change, and you > seem to feel like sending me untested stuff that nobody else has ever > seen the day before the next rc release is just fine. I've got a team lined up, just secured funding to start paying them and it looks like I'm about to secure more. And the community is growing, I'm reviewing and taking patches from more people, and regularly mentoring them on the codebase. And on top of all that, you shouting about "process" rings pretty hollow when I _remember_ the days when you guys were rewriting core mm code in rc kernels. Given where bcachefs is at in the lifecycle of a big codebase being stabilized, you should be expecting to see stuff like that here. Stuff is getting found and fixed, and then we ship those fixes so we can find the next stuff. > You're a smart person. I feel like I've given you enough hints. Why > don't you sit back and think about it, and let's make it clear: you > have exactly two choices here: > > (a) play better with others > > (b) take your toy and go home You've certainly yelled a lot...
On Sat, 5 Oct 2024 at 15:54, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > The vast majority of those fixes are all ~2 weeks old. With the patches not appearing on the list, that seems entirely irrelevant. Apparently they are 2 weeks on IN YOUR TREE. And absolutely nowhere else. > Let that sink in. Seriously. You completely dodged my actual argument, except for pointing at how we didn't have process two decades ago. If you can't actually even face this, what's the point any more? Linus
On Sat, Oct 05, 2024 at 04:15:25PM GMT, Linus Torvalds wrote: > On Sat, 5 Oct 2024 at 15:54, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > The vast majority of those fixes are all ~2 weeks old. > > With the patches not appearing on the list, that seems entirely irrelevant. > > Apparently they are 2 weeks on IN YOUR TREE. > > And absolutely nowhere else. If what you want is patches appearing on the list, I'm not unwilling to make that change. I take issue, and indeed even dig my heels in, when the only people asking for that are _only_ yelling about that and aren't involved otherwise. But you will find that if you talk to me as one human being to another, where we can share and listen to each other's concerns, I'm more than happy to be reasonable. But I'm not going to just toe the line when it's just yelling. Seriously. Because the last time you flipped out over a pull request, I spent the rest of the cycle telling people "x y and z are fixed, but you'll have to build my tree instead of running a released kernel". And that gets tiresome; some of the bugs were significant - and no issues to date have been found in the stuff you kicked back, which tells me my process is just fine. So let _that_ sink in. In order to support my userbase, as well as iterate to find the _next_ set of bugs, I have to be able to ship bugfixes in a timely manner, and if that's going to keep being an issue perhaps I should be having those conversations with distro kernel maintainers now, instead of later. > > Let that sink in. > > Seriously. > > You completely dodged my actual argument, except for pointing at how > we didn't have process two decades ago. > > If you can't actually even face this, what's the point any more? Face _what_ exactly? Because at this point, I can't even tell what it is you want, what you're reacting to keeps shifting.
On Sat, Oct 05, 2024 at 07:41:03PM GMT, Kent Overstreet wrote: > Face _what_ exactly? Because at this point, I can't even tell what it is > you want, what you're reacting to keeps shifting. And more than that, I'm done with trying to cater, and I'm done with these long winded rants. Look, I quite enjoy the direct approach, but I'm done with having to apologize for you in order to calm people down every time this happens. If you're so convinced you know best, I invite you to start writing your own filesystem. Go for it.
On Sat, 5 Oct 2024 at 16:41, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > If what you want is patches appearing on the list, I'm not unwilling to > make that change. I want you to WORK WITH OTHERS. Including me - which means working with the rules and processes we have in place. Making the argument that we didn't have those rules twenty years ago is just stupid. We have them NOW, because we learnt better. You don't get to say "look, you didn't have rules 20 years ago, so why should I have them now?" Patches appearing on the list is not some kind of sufficient thing. It's the absolute minimal requirement. The fact that absolutely *NONE* of the patches in your pull request showed up when I searched just means that you clearly didn't even attempt to have others involved (ok, I probably only searched for half of them and then I gave up in disgust). We literally had a bcachefs build failure last week. It showed up pretty much immediately after I pulled your tree. And because you sent in the bcachefs "fixes" with the bug the day before I cut rc1, we ended up with a broken rc1. And hey, mistakes happen. But when the *SAME* absolute disregard for testing happens the very next weekend, do you really expect me to be happy about it? It's this complete disregard for anybody else that I find problematic. You don't even try to get other developers involved, or follow upstream rules. And then you don't seem to even understand why I then complain. In fact, you in the next email say: > If you're so convinced you know best, I invite you to start writing your > own filesystem. Go for it. Not at all. I'm not interested in creating another bcachefs. I'm contemplating just removing bcachefs entirely from the mainline tree. Because you show again and again that you have no interest in trying to make mainline work. You can do it out of mainline. You did it for a decade, and that didn't cause problems. I thought it would be better if it finally got mainlined, but by all your actions you seem to really want to just play in your own sandbox and not involve anybody else. So if this is just your project and nobody else is expected to participate, and you don't care about the fact that you break the mainline build, why the hell did you want to be in the mainline tree in the first place? Linus
On Sat, Oct 05, 2024 at 05:14:31PM GMT, Linus Torvalds wrote: > On Sat, 5 Oct 2024 at 16:41, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > If what you want is patches appearing on the list, I'm not unwilling to > > make that change. > > I want you to WORK WITH OTHERS. Including me - which means working > with the rules and processes we have in place. That has to work both ways. Because when I explain my reasoning and processes, and it's ignored and the same basic stuff is repeatedly yelled back, I'm just going to tune it out. I'm more than happy to work with people, but that's got to be a conversation, and one based on mutual respect. > Making the argument that we didn't have those rules twenty years ago > is just stupid. We have them NOW, because we learnt better. You don't > get to say "look, you didn't have rules 20 years ago, so why should I > have them now?" That wasn't my argument. My point was that a codebase at an earlier phase of development, that hasn't had as long to stabilize, is inherently going to be more in flux. Earlier in development fixing bugs is going to be a high prioritity, relatively speaking, vs. avoiding regressions; sometimes the important thing is to make forward progress, iterate, and ship and get feedback from users. I think the way you guys were doing development 20 years ago was entirely appropriate at that time, and that's what I need to be doing now; I need to be less conservative than the kernel as a whole. That isn't to say that there aren't things we can and should be doing to mitigate that (i.e. improving build testing, which now that I'm finishing up with the current project I can do), or that there isn't room for discussion on the particulars. But seriously; bcachefs is shaping up far better than btrfs (which, afaik _did_ try to play by all the rules), and process has _absolutely_ been a factor in that. > Patches appearing on the list is not some kind of sufficient thing. > It's the absolute minimal requirement. The fact that absolutely *NONE* > of the patches in your pull request showed up when I searched just > means that you clearly didn't even attempt to have others involved > (ok, I probably only searched for half of them and then I gave up in > disgust). Those fixes were all pretty basic, and broadly speaking I know what everyone else who's working on bcachefs is doing and what they're working on. Hongbo has been quite helpful with a bunch of things (and starting to help out in the bug tracker and IRC channel), Alan has been digging around in six locks and most recently the cycle detector code, and I've been answering questions as he learns his way around, Thomas has been getting started on some backpointers scalability work. Nothing will be served by having them review thoroughly a big stream of small uninteresting fixes, it'd suck up all their time and prevent them from doing anything productive. I have, quite literally, tried this and had it happen on multiple occasions in the past. I do post when I've got something more interesting going on, and I'd been anticipating posting more as the stabilizing slows down. > We literally had a bcachefs build failure last week. It showed up > pretty much immediately after I pulled your tree. And because you sent > in the bcachefs "fixes" with the bug the day before I cut rc1, we > ended up with a broken rc1. > > And hey, mistakes happen. But when the *SAME* absolute disregard for > testing happens the very next weekend, do you really expect me to be > happy about it? And I do apologize for the build failure, and I will get on the automated multi-arch build testing - that needed to happen anyways. But I also have to remind you that I'm one of the few people who's actually been pushing for more and better automated testing (I now have infrastructure for the communty that anyone can use, just ask me for an account) - and that's been another solo effort because so few people are even interested, so the fact that this even came up grates on me. This is a problem with a technical solution, and instead we're all just arguing. > It's this complete disregard for anybody else that I find problematic. > You don't even try to get other developers involved, or follow > upstream rules. Linus, just because you don't see it doesn't mean it doesn't exist. I spend a significant fraction of my day on IRC and the phone with both users and other developers. And "upstream rules" has always been a fairly ad-hoc thing, which even you barely seem able to spell out. It's taken _forever_ to get to "yes, you do want patches on the list", and you seem to have some feeling that the volume of fixes is an issue for you, but god only knows if that's more than a hazy feeling for you. > > If you're so convinced you know best, I invite you to start writing your > > own filesystem. Go for it. > > Not at all. I'm not interested in creating another bcachefs. > > I'm contemplating just removing bcachefs entirely from the mainline > tree. Because you show again and again that you have no interest in > trying to make mainline work. You can do that, and it won't be the end of the world for me (although a definite inconvenience) - but it's going to suck for a lot of users. > You can do it out of mainline. You did it for a decade, and that > didn't cause problems. I thought it would be better if it finally got > mainlined, but by all your actions you seem to really want to just > play in your own sandbox and not involve anybody else. > > So if this is just your project and nobody else is expected to > participate, and you don't care about the fact that you break the > mainline build, why the hell did you want to be in the mainline tree > in the first place? Honestly? Because I want Linux to have a filesystem we can all be proud of, that users can rely on, that has a level of robustness and polish that we can all aspire to.
Here is a user's perspective from someone who's built a career from Linux (thanks to all of you)... The big hardship with testing bcachefs before it was merged into the kernel was that it couldn't be built as an out-of-tree module and instead a whole other kernel tree needed to be built. That was a pain. Now, the core kernel infrastructure changes that bcachefs relies on are in the kernel and bcachefs can very easily and quickly be built as an out-of-tree module in just a few seconds. I submit to all involved that maybe that's the best way to go **for now**. Switching to out of tree for now would make it much easier for Kent to have the fast-paced development model he desires for this stage in bcachefs' development. It would also make using and testing bcachefs much easier for power users like me because when an issue is detected we could get a fix or new feature much faster than having to wait for a distribution to ship the next kernel version and with less ancillary risk than building and using a less-tested kernel tree. Distributions themselves also are very familiar with packaging up out-of-tree modules and distribution tools like dkms make using them dead simple even for casual users. The way things are now isn't great for me as a Linux power user. I often want to use the latest or even RC kernels on my systems to get some new hardware support or other feature and I'm used to being able to do that without too many problems. But recently I've had to skip cutting-edge kernel versions that I otherwise wanted to try because there have been issues in bcachefs that I didn't want to have to face or work around. Switching to an out of tree module for now would be the best of all worlds for me because I could pick and choose which combination of kernel / bcachefs to use for each system and situation. Just my 2¢. Carl > On 2024-10-05 5:14 PM PDT Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Sat, 5 Oct 2024 at 16:41, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > If what you want is patches appearing on the list, I'm not unwilling to > > make that change. > > I want you to WORK WITH OTHERS. Including me - which means working > with the rules and processes we have in place. > > Making the argument that we didn't have those rules twenty years ago > is just stupid. We have them NOW, because we learnt better. You don't > get to say "look, you didn't have rules 20 years ago, so why should I > have them now?" > > Patches appearing on the list is not some kind of sufficient thing. > It's the absolute minimal requirement. The fact that absolutely *NONE* > of the patches in your pull request showed up when I searched just > means that you clearly didn't even attempt to have others involved > (ok, I probably only searched for half of them and then I gave up in > disgust). > > We literally had a bcachefs build failure last week. It showed up > pretty much immediately after I pulled your tree. And because you sent > in the bcachefs "fixes" with the bug the day before I cut rc1, we > ended up with a broken rc1. > > And hey, mistakes happen. But when the *SAME* absolute disregard for > testing happens the very next weekend, do you really expect me to be > happy about it? > > It's this complete disregard for anybody else that I find problematic. > You don't even try to get other developers involved, or follow > upstream rules. > > And then you don't seem to even understand why I then complain. > > In fact, you in the next email say: > > > If you're so convinced you know best, I invite you to start writing your > > own filesystem. Go for it. > > Not at all. I'm not interested in creating another bcachefs. > > I'm contemplating just removing bcachefs entirely from the mainline > tree. Because you show again and again that you have no interest in > trying to make mainline work. > > You can do it out of mainline. You did it for a decade, and that > didn't cause problems. I thought it would be better if it finally got > mainlined, but by all your actions you seem to really want to just > play in your own sandbox and not involve anybody else. > > So if this is just your project and nobody else is expected to > participate, and you don't care about the fact that you break the > mainline build, why the hell did you want to be in the mainline tree > in the first place? > > Linus
On Sat, Oct 05, 2024 at 06:20:53PM GMT, Carl E. Thompson wrote: > Here is a user's perspective from someone who's built a career from Linux (thanks to all of you)... > > The big hardship with testing bcachefs before it was merged into the kernel was that it couldn't be built as an out-of-tree module and instead a whole other kernel tree needed to be built. That was a pain. > > Now, the core kernel infrastructure changes that bcachefs relies on are in the kernel and bcachefs can very easily and quickly be built as an out-of-tree module in just a few seconds. I submit to all involved that maybe that's the best way to go **for now**. > > Switching to out of tree for now would make it much easier for Kent to have the fast-paced development model he desires for this stage in bcachefs' development. It would also make using and testing bcachefs much easier for power users like me because when an issue is detected we could get a fix or new feature much faster than having to wait for a distribution to ship the next kernel version and with less ancillary risk than building and using a less-tested kernel tree. Distributions themselves also are very familiar with packaging up out-of-tree modules and distribution tools like dkms make using them dead simple even for casual users. > > The way things are now isn't great for me as a Linux power user. I > often want to use the latest or even RC kernels on my systems to get > some new hardware support or other feature and I'm used to being able > to do that without too many problems. But recently I've had to skip > cutting-edge kernel versions that I otherwise wanted to try because > there have been issues in bcachefs that I didn't want to have to face > or work around. Switching to an out of tree module for now would be > the best of all worlds for me because I could pick and choose which > combination of kernel / bcachefs to use for each system and situation. Carl - thanks, I wasn't aware of this. Can you give me details? 6.11 had the disk accounting rewrite, which was huge and (necessarily) had some fallout, if you're seeing regressions otherwise that are slipping through then - yes it's time to slow down and reevaluate. Details would be extremely helpful, so we can improve our regression testing.
Yeah, of course there were the disk accounting issues and before that was the kernel upgrade-downgrade bug going from 6.8 back to 6.7. Currently over on Reddit at least one user is mention read errors and / or performance regressions on the current RC version that I'd rather avoid. There were a number of other issues that cropped up in some earlier versions but not others such as deadlocks when using compression (particularly zstd), weirdness when using compression with 4k blocks and suspend / resume failures when using bcachefs. None of those things were a big deal to me as I mostly only use bcachefs on root filesystems which are of course easy to recreate. But I do currently use bcachefs for all the filesystems on my main laptop so issues there can be more of a pain. As an example of potential issues I'd like to avoid I often upgrade my laptop and swap the old SSD in and am currently considering pulling the trigger on a Ryzen AI laptop such as the ProArt P16. However, this new processor has some cutting edge features only fully supported in 6.12 so I'd prefer to use that kernel if I can. But... because according to Reddit there are apparently issues with bcachefs in the 6.12RC kernels that means I am hesitant to buy the laptop and use the RC kernel the carefree manor I normally would. Yeah, first world problems! Speaking of Reddit, I don't know if you saw it but a user there quotes you as saying users who use release candidates should expect them to be "dangerous as crap." I could not find a post where you said that in the thread that user pointed to but if you **did** say something like that then I guess I have a different concept of what "release candidate" means. So for me it would be a lot easier if bcachefs versions were decoupled from kernel versions. Thanks, Carl > On 2024-10-05 6:56 PM PDT Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > On Sat, Oct 05, 2024 at 06:20:53PM GMT, Carl E. Thompson wrote: > > Here is a user's perspective from someone who's built a career from Linux (thanks to all of you)... > > > > The big hardship with testing bcachefs before it was merged into the kernel was that it couldn't be built as an out-of-tree module and instead a whole other kernel tree needed to be built. That was a pain. > > > > Now, the core kernel infrastructure changes that bcachefs relies on are in the kernel and bcachefs can very easily and quickly be built as an out-of-tree module in just a few seconds. I submit to all involved that maybe that's the best way to go **for now**. > > > > Switching to out of tree for now would make it much easier for Kent to have the fast-paced development model he desires for this stage in bcachefs' development. It would also make using and testing bcachefs much easier for power users like me because when an issue is detected we could get a fix or new feature much faster than having to wait for a distribution to ship the next kernel version and with less ancillary risk than building and using a less-tested kernel tree. Distributions themselves also are very familiar with packaging up out-of-tree modules and distribution tools like dkms make using them dead simple even for casual users. > > > > The way things are now isn't great for me as a Linux power user. I > > often want to use the latest or even RC kernels on my systems to get > > some new hardware support or other feature and I'm used to being able > > to do that without too many problems. But recently I've had to skip > > cutting-edge kernel versions that I otherwise wanted to try because > > there have been issues in bcachefs that I didn't want to have to face > > or work around. Switching to an out of tree module for now would be > > the best of all worlds for me because I could pick and choose which > > combination of kernel / bcachefs to use for each system and situation. > > Carl - thanks, I wasn't aware of this. > > Can you give me details? 6.11 had the disk accounting rewrite, which was > huge and (necessarily) had some fallout, if you're seeing regressions > otherwise that are slipping through then - yes it's time to slow down > and reevaluate. > > Details would be extremely helpful, so we can improve our regression > testing.
On Sat, Oct 05, 2024 at 08:06:31PM GMT, Carl E. Thompson wrote: > Yeah, of course there were the disk accounting issues and before that > was the kernel upgrade-downgrade bug going from 6.8 back to 6.7. > Currently over on Reddit at least one user is mention read errors and > / or performance regressions on the current RC version that I'd rather > avoid. So, disk accounting rewrite: that code was basically complete, just baking, for a full six months before merging - so, not exactly rushed, and it saw user testing before merging. Given the size, and how invasive it was, some regressions were inevitable and they were pretty small and localized. The upgrade/downgrade bug was really nasty, yeah. > There were a number of other issues that cropped up in some earlier > versions but not others such as deadlocks when using compression > (particularly zstd), weirdness when using compression with 4k blocks > and suspend / resume failures when using bcachefs. I don't believe any of those were bcachefs regressions, although some are bcachefs bugs - suspend/resume for example there's still an open bug. I've seen multiple compression bugs that were mostly not bcachefs bugs (i.e. there was a zstd bug that affected bcachefs that took forever to fix, and there's a recently reported LZ4HC bug that may or may not be bcachefs). > None of those things were a big deal to me as I mostly only use > bcachefs on root filesystems which are of course easy to recreate. But > I do currently use bcachefs for all the filesystems on my main laptop > so issues there can be more of a pain. Are you talking about issues you've hit, or issues that you've seen reported? Because the main subject of discussion is regressions. > > As an example of potential issues I'd like to avoid I often upgrade my > laptop and swap the old SSD in and am currently considering pulling > the trigger on a Ryzen AI laptop such as the ProArt P16. However, this > new processor has some cutting edge features only fully supported in > 6.12 so I'd prefer to use that kernel if I can. But... because > according to Reddit there are apparently issues with bcachefs in the > 6.12RC kernels that means I am hesitant to buy the laptop and use the > RC kernel the carefree manor I normally would. Yeah, first world > problems! The main 6.12-rc1 issue was actually caused by Christain's change to inode state wakeups - it was a VFS change where bcachefs wasn't updated. That should've been caught by automated testing on fs-next - so that one's on me; fs-next is still fairly new and I still need to get that going. > Speaking of Reddit, I don't know if you saw it but a user there quotes > you as saying users who use release candidates should expect them to > be "dangerous as crap." I could not find a post where you said that in > the thread that user pointed to but if you **did** say something like > that then I guess I have a different concept of what "release > candidate" means. I don't recall saying that, but I did say something about Canonical shipping rc kernels to the general population - that's a bit crazy. Rc kernels should generally be run by users who know what they're getting into and have some ability to help test and debug. > So for me it would be a lot easier if bcachefs versions were decoupled > from kernel versions. Well, this sounds more like generalized concern than anything concrete I can act on, to be honest - but if you've got regressions that you've been hit by, please tell me about those. The feedback I've generally been getting has been that each release has been getting steadily better, and more stable and usable - and lately pretty much all I've been doing has been fixing user reported bugs, so those I naturally want to get out quickly if the bugs are serious enough and I'm confident that they'll be low risk - and there has been a lot of that. The shrinker fixes for fsck OOMing that didn't land in 6.11 were particularly painful for a lot of users. The key cache/rcu pending work that didn't land in 6.11, that was a major usability issue for several users that I talked to. The past couple weeks I've been working on filesystem repair and snapshots issues for several users that were inadvertently torture testing snapshots - the fixes are turning out to be fairly involved, but I'm also weighing there "how likely are other users to be affected by this, and do we want to wait another 3 months", and I've got multiple reports of affected users.
On Sat, Oct 05, 2024 at 08:54:32PM -0400, Kent Overstreet wrote: > But I also have to remind you that I'm one of the few people who's > actually been pushing for more and better automated testing (I now have > infrastructure for the communty that anyone can use, just ask me for an > account) - and that's been another solo effort because so few people are > even interested, so the fact that this even came up grates on me. This > is a problem with a technical solution, and instead we're all just > arguing. Um, hello? All of the file system developers have our own automated testing, and my system, {kvm,gce,android}-xfstests[1][[2] and Luis's kdevops[3] are both availble for others to use. We've done quite a lot in terms of doumentations and making it easier for others to use. (And that's not incluing the personal test runners used by folks like Josef, Cristoph, Dave, and Darrick.) [1] https://thunk.org/gce-xfstest [2] https://github.com/tytso/xfstests-bld [3] https://github.com/linux-kdevops/kdevops That's why we're not particularly interested in yours --- my system has been in active use since 2011, and it's been well-tuned for me and others to use. (For example, Leah has been using it for XFS stable backports, and it's also used for testing Google's Data Center kernels, and GCE's Cloud Optimized OS.) You may believe that yours is better than anyone else's, but with respect, I disagree, at least for my own workflow and use case. And if you look at the number of contributors in both Luis and my xfstests runners[2][3], I suspect you'll find that we have far more contributors in our git repo than your solo effort.... - Ted
On Sun, Oct 06, 2024 at 12:30:02AM GMT, Theodore Ts'o wrote: > On Sat, Oct 05, 2024 at 08:54:32PM -0400, Kent Overstreet wrote: > > But I also have to remind you that I'm one of the few people who's > > actually been pushing for more and better automated testing (I now have > > infrastructure for the communty that anyone can use, just ask me for an > > account) - and that's been another solo effort because so few people are > > even interested, so the fact that this even came up grates on me. This > > is a problem with a technical solution, and instead we're all just > > arguing. > > Um, hello? All of the file system developers have our own automated > testing, and my system, {kvm,gce,android}-xfstests[1][[2] and Luis's > kdevops[3] are both availble for others to use. We've done quite a > lot in terms of doumentations and making it easier for others to use. > (And that's not incluing the personal test runners used by folks like > Josef, Cristoph, Dave, and Darrick.) > > [1] https://thunk.org/gce-xfstest > [2] https://github.com/tytso/xfstests-bld > [3] https://github.com/linux-kdevops/kdevops > > That's why we're not particularly interested in yours --- my system > has been in active use since 2011, and it's been well-tuned for me and > others to use. (For example, Leah has been using it for XFS stable > backports, and it's also used for testing Google's Data Center > kernels, and GCE's Cloud Optimized OS.) > > You may believe that yours is better than anyone else's, but with > respect, I disagree, at least for my own workflow and use case. And > if you look at the number of contributors in both Luis and my xfstests > runners[2][3], I suspect you'll find that we have far more > contributors in our git repo than your solo effort.... Correct me if I'm wrong, but your system isn't available to the community, and I haven't seen a CI or dashboard for kdevops? Believe me, I would love to not be sinking time into this as well, but we need to standardize on something everyone can use.
Hi Kent, hi Linus. Kent Overstreet - 06.10.24, 02:54:32 CEST: > On Sat, Oct 05, 2024 at 05:14:31PM GMT, Linus Torvalds wrote: > > On Sat, 5 Oct 2024 at 16:41, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > If what you want is patches appearing on the list, I'm not unwilling > > > to > > > make that change. > > > > I want you to WORK WITH OTHERS. Including me - which means working > > with the rules and processes we have in place. > > That has to work both ways. Exactly, Kent. And it is my impression from reading the whole thread up to now and from reading previous threads it is actually about: Having your way and your way only. That is not exactly "work both ways". Quite similarly regarding your stand towards distributions like Debian. Sure you can question well established rules all the way you want and maybe you are even right about it. I do not feel qualified enough to judge on that. I am all for challenging well established rules on justified grounds… But… even if that is the case it is still a negotiation process. Expecting that communities change well established rules on the spot just cause you are asking for it… quite bold if you ask me. It would be a negotiation process and work both ways would mean to agree on some kind of middle ground. But it appears to me you do not seem to have the patience for such a process. So it is arguing on both sides which costs a lot of energy of everyone involved. From what I perceive you are actually actively working against well established rules. And you are surprised on the reaction? That is kind of naive if you ask me. At least you wrote you are willing to post patches to the mailing list: So why not start with at least that *minimal* requirement according to Linus as a step you do? Maybe even just as a sign of good will towards the kernel community? That has been asked of you concretely, so why not just do it? Maybe this can work out by negotiating a middle ground going one little step at a time? I still do have a BCacheFS on my laptop for testing, but meanwhile I wonder whether some of the crazy kernel regressions I have seen with the last few kernels where exactly related to having mounted that BCacheFS test filesystem. I am tempted to replace the BCacheFS with a BTRFS just to find out. Lastly 6.10.12-1 Debian kernel crashes on a pool-spawner thread when I enter the command „reboot“. That is right a reboot crashes the system – I never have seen anything this crazy with any Linux kernel so far! I have made a photo of it but after that long series of regressions I am even too tired to post a bug report about it just to be told again to bisect the issue. And it is not the first work queue related issue I found between 6.8 and 6.11 kernels. Actually I think I just replace that BCacheFS with another BTRFS in order to see whether it reduces the amount of crazy regressions I got so fed up with recently. Especially its not fair to report all of this to the Lenovo Linux community guy Mark Pearson in case its not even related to the new ThinkPad T14 AMD Gen 5 I am using. Mind you that series of regressions started with a T14 AMD Gen 1 roughly at the time I started testing BCacheFS and I had hoped they go away with the new laptop. Additionally I have not seen a single failure with BTRFS on any on my systems – including quite some laptops and several servers, even using LXC containers – for… I don't remember when. Since kernel 4.6 BTRFS at least for me is rock stable. And I agree, it took a huge lot of time until it was stable. But whether that is due to the processes you criticize or other reasons or a combination thereof… do you know for sure? I am wondering: did the mainline kernel just get so much more unstable in the last 3-6 months or may there be a relationship to the test BCacheFS filesystem I was using that eluded me so far. Of course, I do not know for now, but reading Carl's mails really made me wonder. Maybe there is none, so don't get me wrong… but reading this thread got me suspicious now. I am happily proven wrong on that suspicion and I commit to report back on it. Especially when the amount of regressions does not decline and I got suspicious of BCacheFS unjustly. Best,
On Sun, Oct 06, 2024 at 01:49:23PM GMT, Martin Steigerwald wrote: > Hi Kent, hi Linus. > > Kent Overstreet - 06.10.24, 02:54:32 CEST: > > On Sat, Oct 05, 2024 at 05:14:31PM GMT, Linus Torvalds wrote: > > > On Sat, 5 Oct 2024 at 16:41, Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > If what you want is patches appearing on the list, I'm not unwilling > > > > to > > > > make that change. > > > > > > I want you to WORK WITH OTHERS. Including me - which means working > > > with the rules and processes we have in place. > > > > That has to work both ways. > > Exactly, Kent. > > And it is my impression from reading the whole thread up to now and from > reading previous threads it is actually about: Having your way and your > way only. > > That is not exactly "work both ways". > > Quite similarly regarding your stand towards distributions like Debian. My issue wasn't with Debian as a whole; it was with one particular packaging rule which was causing issues, and a maintainer who - despite warnings that it would cause issues - broke the build and sat on it, leaving a broken version up, which resulted in users unable to access their filesystems when they couldn't mount in degraded mode. > I still do have a BCacheFS on my laptop for testing, but meanwhile I > wonder whether some of the crazy kernel regressions I have seen with the > last few kernels where exactly related to having mounted that BCacheFS > test filesystem. I am tempted to replace the BCacheFS with a BTRFS just to > find out. I think you should be looking elsewhere - there have been zero reports of random crashes or anything like what you're describing. Even in syzbot testing we've been pretty free from the kind of memory safety issues that would cause random crashes The closest bugs to what you're describing would be the __wait_on_freeing_inode() deadlock in 6.12-rc1, and the LZ4HC crash that I've yet to triage - but you specifically have to be using lz4:15 compression to hit that path. The worst syzbot has come up with is something strange at the boundary with the crypto code, and I haven't seen any user reports that line up with that one.
On Sat, 5 Oct 2024 at 21:33, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Sun, Oct 06, 2024 at 12:30:02AM GMT, Theodore Ts'o wrote: > > > > You may believe that yours is better than anyone else's, but with > > respect, I disagree, at least for my own workflow and use case. And > > if you look at the number of contributors in both Luis and my xfstests > > runners[2][3], I suspect you'll find that we have far more > > contributors in our git repo than your solo effort.... > > Correct me if I'm wrong, but your system isn't available to the > community, and I haven't seen a CI or dashboard for kdevops? > > Believe me, I would love to not be sinking time into this as well, but > we need to standardize on something everyone can use. I really don't think we necessarily need to standardize. Certainly not across completely different subsystems. Maybe filesystem people have something in common, but honestly, even that is rather questionable. Different filesystems have enough different features that you will have different testing needs. And a filesystem tree and an architecture tree (or the networking tree, or whatever) have basically almost _zero_ overlap in testing - apart from the obvious side of just basic build and boot testing. And don't even get me started on drivers, which have a whole different thing and can generally not be tested in some random VM at all. So no. People should *not* try to standardize on something everyone can use. But _everybody_ should participate in the basic build testing (and the basic boot testing we have, even if it probably doesn't exercise much of most subsystems). That covers a *lot* of stuff that various domain-specific testing does not (and generally should not). For example, when you do filesystem-specific testing, you very seldom have much issues with different compilers or architectures. Sure, there can be compiler version issues that affect behavior, but let's be honest: it's very very rare. And yes, there are big-endian machines and the whole 32-bit vs 64-bit thing, and that can certainly affect your filesystem testing, but I would expect it to be a fairly rare and secondary thing for you to worry about when you try to stress your filesystem for correctness. But build and boot testing? All those random configs, all those odd architectures, and all those odd compilers *do* affect build testing. So you as a filesystem maintainer should *not* generally strive to do your own basic build test, but very much participate in the generic build test that is being done by various bots (not just on linux-next, but things like the 0day bot on various patch series posted to the list etc). End result: one size does not fit all. But I get unhappy when I see some subsystem that doesn't seem to participate in what I consider the absolute bare minimum. Btw, there are other ways to make me less unhappy. For example, a couple of years ago, we had a string of issues with the networking tree. Not because there was any particular maintenance issue, but because the networking tree is basically one of the biggest subsystems there are, and so bugs just happen more for that simple reason. Random driver issues that got found resolved quickly, but that kept happening in rc releases (or even final releases). And that was *despite* the networking fixes generally having been in linux-next. Now, the reason I mention the networking tree is that the one simple thing that made it a lot less stressful was that I asked whether the networking fixes pulls could just come in on Thursday instead of late on Friday or Saturday. That meant that any silly things that the bots picked up on (or good testers picked up on quickly) now had an extra day or two to get resolved. Now, it may be that the string of unfortunate networking issues that caused this policy were entirely just bad luck, and we just haven't had that. But the networking pull still comes in on Thursdays, and we've been doing it that way for four years, and it seems to have worked out well for both sides. I certainly feel a lot better about being able to do the (sometimes fairly sizeable) pull on a Thursday, knowing that if there is some last-minute issue, we can still fix just *that* before the rc or final release. And hey, that's literally just a "this was how we dealt with one particular situation". Not everybody needs to have the same rules, because the exact details will be different. I like doing releases on Sundays, because that way the people who do a fairly normal Mon-Fri week come in to a fresh release (whether rc or not). And people tend to like sending in their "work of the week" to me on Fridays, so I get a lot of pull requests on Friday, and most of the time that works just fine. So the networking tree timing policy ended up working quite well for that, but there's no reason it should be "The Rule" and that everybody should do it. But maybe it would lessen the stress on both sides for bcachefs too if we aimed for that kind of thing? Linus
On Sun, Oct 06, 2024 at 12:04:45PM GMT, Linus Torvalds wrote: > On Sat, 5 Oct 2024 at 21:33, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > On Sun, Oct 06, 2024 at 12:30:02AM GMT, Theodore Ts'o wrote: > > > > > > You may believe that yours is better than anyone else's, but with > > > respect, I disagree, at least for my own workflow and use case. And > > > if you look at the number of contributors in both Luis and my xfstests > > > runners[2][3], I suspect you'll find that we have far more > > > contributors in our git repo than your solo effort.... > > > > Correct me if I'm wrong, but your system isn't available to the > > community, and I haven't seen a CI or dashboard for kdevops? > > > > Believe me, I would love to not be sinking time into this as well, but > > we need to standardize on something everyone can use. > > I really don't think we necessarily need to standardize. Certainly not > across completely different subsystems. > > Maybe filesystem people have something in common, but honestly, even > that is rather questionable. Different filesystems have enough > different features that you will have different testing needs. > > And a filesystem tree and an architecture tree (or the networking > tree, or whatever) have basically almost _zero_ overlap in testing - > apart from the obvious side of just basic build and boot testing. > > And don't even get me started on drivers, which have a whole different > thing and can generally not be tested in some random VM at all. Drivers are obviously a whole different ballgame, but what I'm after is more - tooling the community can use - some level of common infrastructure, so we're not all rolling our own. "Test infrastructure the community can use" is a big one, because enabling the community and making it easier for people to participate and do real development is where our pipeline of new engineers comes from. Over the past 15 years, I've seen the filesystem community get smaller and older, and that's not a good thing. I've had some good success with giving ktest access to people in the community, who then start using it actively and contributing (small, so far) patches (and interesting, a lot of the new activity is from China) - this means they can do development at a reasonable pace and I don't have to look at their code until it's actually passing all the tests, which is _huge_. And filesystem tests take overnight to run on a single machine, so having something that gets them results back in 20 minutes is also huge. The other thing I'd really like is to take the best of what we've got for testrunner/CI dashboard (and opinions will vary, but of course I like ktest the best) and make it available to other subsystems (mm, block, kselftests) because not everyone has time to roll their own. That takes a lot of facetime - getting to know people's workflows, porting tests - so it hasn't happened as much as I'd like, but it's still an active interest of mine. > So no. People should *not* try to standardize on something everyone can use. > > But _everybody_ should participate in the basic build testing (and the > basic boot testing we have, even if it probably doesn't exercise much > of most subsystems). That covers a *lot* of stuff that various > domain-specific testing does not (and generally should not). > > For example, when you do filesystem-specific testing, you very seldom > have much issues with different compilers or architectures. Sure, > there can be compiler version issues that affect behavior, but let's > be honest: it's very very rare. And yes, there are big-endian machines > and the whole 32-bit vs 64-bit thing, and that can certainly affect > your filesystem testing, but I would expect it to be a fairly rare and > secondary thing for you to worry about when you try to stress your > filesystem for correctness. But - a big gap right now is endian /portability/, and that one is a pain to cover with automated tests because you either need access to both big and little endian hardware (at a minumm for creating test images), or you need to run qemu in full-emulation mode, which is pretty unbearably slow. > But build and boot testing? All those random configs, all those odd > architectures, and all those odd compilers *do* affect build testing. > So you as a filesystem maintainer should *not* generally strive to do > your own basic build test, but very much participate in the generic > build test that is being done by various bots (not just on linux-next, > but things like the 0day bot on various patch series posted to the > list etc). > > End result: one size does not fit all. But I get unhappy when I see > some subsystem that doesn't seem to participate in what I consider the > absolute bare minimum. So the big issue for me has been that with the -next/0day pipeline, I have no visibility into when it finishes; which means it has to go onto my mental stack of things to watch for and becomes yet another thing to pipeline, and the more I have to pipeline the more I lose track of things. (Seriously: when I am constantly tracking 5 different bug reports and talking to 5 different users, every additional bit of mental state I have to remember is death by a thousand cuts). Which would all be solved with a dashboard - which is why adding the bulid testing to ktest (or ideally, stealing _all_ the 0day tests for ktest) is becoming a bigger and bigger priority. > Btw, there are other ways to make me less unhappy. For example, a > couple of years ago, we had a string of issues with the networking > tree. Not because there was any particular maintenance issue, but > because the networking tree is basically one of the biggest subsystems > there are, and so bugs just happen more for that simple reason. Random > driver issues that got found resolved quickly, but that kept happening > in rc releases (or even final releases). > > And that was *despite* the networking fixes generally having been in linux-next. Yeah, same thing has been going on in filesystem land, which is why now have fs-next that we're supposed to be targeting our testing automation at. That one will likely come slower for me, because I need to clear out a bunch of CI failing tests before I'll want to look at that, but it's on my radar. > Now, the reason I mention the networking tree is that the one simple > thing that made it a lot less stressful was that I asked whether the > networking fixes pulls could just come in on Thursday instead of late > on Friday or Saturday. That meant that any silly things that the bots > picked up on (or good testers picked up on quickly) now had an extra > day or two to get resolved. Ok, if fixes coming in on Saturday is an issue for you that's something I can absolutely change. The only _critical_ one for rc2 was the __wait_for_freeing_inode() fix (which did come in late), the rest could've waited until Monday. > Now, it may be that the string of unfortunate networking issues that > caused this policy were entirely just bad luck, and we just haven't > had that. But the networking pull still comes in on Thursdays, and > we've been doing it that way for four years, and it seems to have > worked out well for both sides. I certainly feel a lot better about > being able to do the (sometimes fairly sizeable) pull on a Thursday, > knowing that if there is some last-minute issue, we can still fix just > *that* before the rc or final release. > > And hey, that's literally just a "this was how we dealt with one > particular situation". Not everybody needs to have the same rules, > because the exact details will be different. I like doing releases on > Sundays, because that way the people who do a fairly normal Mon-Fri > week come in to a fresh release (whether rc or not). And people tend > to like sending in their "work of the week" to me on Fridays, so I get > a lot of pull requests on Friday, and most of the time that works just > fine. > > So the networking tree timing policy ended up working quite well for > that, but there's no reason it should be "The Rule" and that everybody > should do it. But maybe it would lessen the stress on both sides for > bcachefs too if we aimed for that kind of thing? Yeah, that sounds like the plan then.
On Oct 7, 2024, at 03:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Sun, Oct 06, 2024 at 12:04:45PM GMT, Linus Torvalds wrote: >> On Sat, 5 Oct 2024 at 21:33, Kent Overstreet <kent.overstreet@linux.dev> wrote: >>> >>> On Sun, Oct 06, 2024 at 12:30:02AM GMT, Theodore Ts'o wrote: >>>> >>>> You may believe that yours is better than anyone else's, but with >>>> respect, I disagree, at least for my own workflow and use case. And >>>> if you look at the number of contributors in both Luis and my xfstests >>>> runners[2][3], I suspect you'll find that we have far more >>>> contributors in our git repo than your solo effort.... >>> >>> Correct me if I'm wrong, but your system isn't available to the >>> community, and I haven't seen a CI or dashboard for kdevops? >>> >>> Believe me, I would love to not be sinking time into this as well, but >>> we need to standardize on something everyone can use. >> >> I really don't think we necessarily need to standardize. Certainly not >> across completely different subsystems. >> >> Maybe filesystem people have something in common, but honestly, even >> that is rather questionable. Different filesystems have enough >> different features that you will have different testing needs. >> >> And a filesystem tree and an architecture tree (or the networking >> tree, or whatever) have basically almost _zero_ overlap in testing - >> apart from the obvious side of just basic build and boot testing. >> >> And don't even get me started on drivers, which have a whole different >> thing and can generally not be tested in some random VM at all. > > Drivers are obviously a whole different ballgame, but what I'm after is > more > - tooling the community can use > - some level of common infrastructure, so we're not all rolling our own. > > "Test infrastructure the community can use" is a big one, because > enabling the community and making it easier for people to participate > and do real development is where our pipeline of new engineers comes > from. Yeah, the CI is really helpful, at least for those who want to get involved in the development of bcachefs. As a new comer, I’m not at all interested in setting up a separate testing environment at the very beginning, which might be time-consuming and costly. > > Over the past 15 years, I've seen the filesystem community get smaller > and older, and that's not a good thing. I've had some good success with > giving ktest access to people in the community, who then start using it > actively and contributing (small, so far) patches (and interesting, a > lot of the new activity is from China) - this means they can do > development at a reasonable pace and I don't have to look at their code > until it's actually passing all the tests, which is _huge_. > > And filesystem tests take overnight to run on a single machine, so > having something that gets them results back in 20 minutes is also huge. Exactly, I can verify some ideas very quickly with the help of the CI. So, a big thank you for all the effort you've put into it! > > The other thing I'd really like is to take the best of what we've got > for testrunner/CI dashboard (and opinions will vary, but of course I > like ktest the best) and make it available to other subsystems (mm, > block, kselftests) because not everyone has time to roll their own. > > That takes a lot of facetime - getting to know people's workflows, > porting tests - so it hasn't happened as much as I'd like, but it's > still an active interest of mine. > >> So no. People should *not* try to standardize on something everyone can use. >> >> But _everybody_ should participate in the basic build testing (and the >> basic boot testing we have, even if it probably doesn't exercise much >> of most subsystems). That covers a *lot* of stuff that various >> domain-specific testing does not (and generally should not). >> >> For example, when you do filesystem-specific testing, you very seldom >> have much issues with different compilers or architectures. Sure, >> there can be compiler version issues that affect behavior, but let's >> be honest: it's very very rare. And yes, there are big-endian machines >> and the whole 32-bit vs 64-bit thing, and that can certainly affect >> your filesystem testing, but I would expect it to be a fairly rare and >> secondary thing for you to worry about when you try to stress your >> filesystem for correctness. > > But - a big gap right now is endian /portability/, and that one is a > pain to cover with automated tests because you either need access to > both big and little endian hardware (at a minumm for creating test > images), or you need to run qemu in full-emulation mode, which is pretty > unbearably slow. > >> But build and boot testing? All those random configs, all those odd >> architectures, and all those odd compilers *do* affect build testing. >> So you as a filesystem maintainer should *not* generally strive to do >> your own basic build test, but very much participate in the generic >> build test that is being done by various bots (not just on linux-next, >> but things like the 0day bot on various patch series posted to the >> list etc). >> >> End result: one size does not fit all. But I get unhappy when I see >> some subsystem that doesn't seem to participate in what I consider the >> absolute bare minimum. > > So the big issue for me has been that with the -next/0day pipeline, I > have no visibility into when it finishes; which means it has to go onto > my mental stack of things to watch for and becomes yet another thing to > pipeline, and the more I have to pipeline the more I lose track of > things. > > (Seriously: when I am constantly tracking 5 different bug reports and > talking to 5 different users, every additional bit of mental state I > have to remember is death by a thousand cuts). > > Which would all be solved with a dashboard - which is why adding the > bulid testing to ktest (or ideally, stealing _all_ the 0day tests for > ktest) is becoming a bigger and bigger priority. > >> Btw, there are other ways to make me less unhappy. For example, a >> couple of years ago, we had a string of issues with the networking >> tree. Not because there was any particular maintenance issue, but >> because the networking tree is basically one of the biggest subsystems >> there are, and so bugs just happen more for that simple reason. Random >> driver issues that got found resolved quickly, but that kept happening >> in rc releases (or even final releases). >> >> And that was *despite* the networking fixes generally having been in linux-next. > > Yeah, same thing has been going on in filesystem land, which is why now > have fs-next that we're supposed to be targeting our testing automation > at. > > That one will likely come slower for me, because I need to clear out a > bunch of CI failing tests before I'll want to look at that, but it's on > my radar. > >> Now, the reason I mention the networking tree is that the one simple >> thing that made it a lot less stressful was that I asked whether the >> networking fixes pulls could just come in on Thursday instead of late >> on Friday or Saturday. That meant that any silly things that the bots >> picked up on (or good testers picked up on quickly) now had an extra >> day or two to get resolved. > > Ok, if fixes coming in on Saturday is an issue for you that's something > I can absolutely change. The only _critical_ one for rc2 was the > __wait_for_freeing_inode() fix (which did come in late), the rest > could've waited until Monday. > >> Now, it may be that the string of unfortunate networking issues that >> caused this policy were entirely just bad luck, and we just haven't >> had that. But the networking pull still comes in on Thursdays, and >> we've been doing it that way for four years, and it seems to have >> worked out well for both sides. I certainly feel a lot better about >> being able to do the (sometimes fairly sizeable) pull on a Thursday, >> knowing that if there is some last-minute issue, we can still fix just >> *that* before the rc or final release. >> >> And hey, that's literally just a "this was how we dealt with one >> particular situation". Not everybody needs to have the same rules, >> because the exact details will be different. I like doing releases on >> Sundays, because that way the people who do a fairly normal Mon-Fri >> week come in to a fresh release (whether rc or not). And people tend >> to like sending in their "work of the week" to me on Fridays, so I get >> a lot of pull requests on Friday, and most of the time that works just >> fine. >> >> So the networking tree timing policy ended up working quite well for >> that, but there's no reason it should be "The Rule" and that everybody >> should do it. But maybe it would lessen the stress on both sides for >> bcachefs too if we aimed for that kind of thing? > > Yeah, that sounds like the plan then.
On Sat, Oct 05, 2024 at 06:54:19PM -0400, Kent Overstreet wrote: > On Sat, Oct 05, 2024 at 03:34:56PM GMT, Linus Torvalds wrote: > > On Sat, 5 Oct 2024 at 11:35, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > > > Several more filesystems repaired, thank you to the users who have been > > > providing testing. The snapshots + unlinked fixes on top of this are > > > posted here: > > > > I'm getting really fed up here Kent. > > > > These have commit times from last night. Which makes me wonder how > > much testing they got. > > The /commit/ dates are from last night, because I polish up commit > messages and reorder until the last might (I always push smaller fixes > up front and fixes that are likely to need rework to the back). > > The vast majority of those fixes are all ~2 weeks old. > > > And before you start whining - again - about how you are fixing bugs, > > let me remind you about the build failures you had on big-endian > > machines because your patches had gotten ZERO testing outside your > > tree. > > No, there simply aren't that many people running big endian. I have > users building and running my trees on a daily basis. If I push > something broken before I go to bed I have bug reports waiting for me > _the next morning_ when I wake up. > > > That was just last week, and I'm getting the strong feeling that > > absolutely nothing was learnt from the experience. > > > > I have pulled this, but I searched for a couple of the commit messages > > on the lists, and found *nothing* (ok, I found your pull request, > > which obviously mentioned the first line of the commit messages). > > > > I'm seriously thinking about just stopping pulling from you, because I > > simply don't see you improving on your model. If you want to have an > > experimental tree, you can damn well have one outside the mainline > > kernel. I've told you before, and nothing seems to really make you > > understand. > > At this point, it's honestly debatable whether the experimental label > should apply. I'm getting bug reports that talk about production use and > working on metadata dumps where the superblock indicates the filesystem > has been in continuous use for years. > > And many, many people talking about how even at this relatively early > point it doesn't fall over like btrfs does. > I tend to ignore these kind of emails, it's been a decade and weirdly the file system development community likes to use btrfs as a punching bag. I honestly don't care what anybody else thinks, but I've gotten feedback from others in the community that they wish I'd say something when somebody says things so patently false. So I'm going to respond exactly once to this, and it'll be me satisfying my quota for this kind of thing for the rest of the year. Btrfs is used by default in the desktop spin of Fedora, openSuse, and maybe some others. Our development community is actively plugged into those places, we drop everything to help when issues arise there. Btrfs is the foundation of the Meta fleet. We rely on its capabilities and, most importantly of all, its stability for our infrastructure. Is it perfect? Absolutely not. You will never hear me say that. I have often, and publicly, said that Meta also uses XFS in our database workloads, because it simply is just better than Btrfs at that. Yes, XFS is better at Btrfs at some things. I'm not afraid to admit that, because my personal worth is not tied to the software projects I'm involved in. Dave Chinner, Darrick Wong, Christoph Hellwig, Eric Sandeen, and many others have done a fantastic job with XFS. I have a lot of respect for them and the work they've done. I've learned a lot from them. Ext4 is better at Btrfs in a lot of those same things. Ted T'so, Andreas Dilger, Jan Kara, and many others have done a fantastic job with ext4. I have learned a lot from all of these developers, all of these file systems, and many others in this community. Bcachefs is doing new and interesting things. There are many things that I see you do that I wish we had the foresight to know were going to be a problem with Btrfs and done it differently. You, along with the wider file system community, have a lot of the same ideals, same practices, and same desire to do your absolute best work. That is an admirable trait, one that we all share. But dragging other people and their projects down is not the sort of behavior that I think should have a place in this community. This is not the kind of community I want to exist in. You are not the only person who does this, but you are the most vocal and constant example of it. Just like I tell my kids, just because somebody else is doing something wrong doesn't mean you get to do it too. We can improve our own projects, we can collaborate, and we can support each others work. Christian and I tag-teamed the mount namespace work. Amir and I tag-teamed the Fanotify HSM work. Those two projects are the most fun and rewarding experiences I've had in the last few years. This work is way more fun when we can work together, and the relationships I've built in this community through this collaboration around solving problems are my most cherished professional relationships. Or we can keep doing this, randomly throwing mud at each other, pissing each other off, making ourselves into unhireable pariahs. I've made my decision, and honestly I think it's better. But what the fuck do I know, I work on btrfs. Thanks, Josef
On Sun, Oct 06, 2024 at 03:29:51PM -0400, Kent Overstreet wrote: > But - a big gap right now is endian /portability/, and that one is a > pain to cover with automated tests because you either need access to > both big and little endian hardware (at a minumm for creating test > images), or you need to run qemu in full-emulation mode, which is pretty > unbearably slow. It's really not that bad, at least for my use cases: https://www.wireguard.com/build-status/ This thing sends pings to my cellphone too. You can poke around in tools/testing/selftests/wireguard/qemu/ if you're curious. It's kinda gnarly but has proven very very flexible to hack up for whatever additional testing I need. For example, I've been using it for some of my recent non-wireguard work here: https://git.zx2c4.com/linux-rng/commit/?h=jd/vdso-test-harness Taking this straight-up probably won't fit for your filesystem work, but maybe it can act as a bit of motivation that automated qemu'ing can generally work. It has definitely caught a lot of silly bugs during development time. If for your cases, this winds up taking 3 days to run instead of the minutes mine needs, so be it, that's a small workflow adjustment thing. You might not get the same dopamine feedback loop of seeing your changes in action and deployed to users _now_, but maybe delaying the gratification a bit will be good anyway. Jason
Kent Overstreet - 06.10.24, 19:18:00 MESZ: > > I still do have a BCacheFS on my laptop for testing, but meanwhile I > > wonder whether some of the crazy kernel regressions I have seen with > > the last few kernels where exactly related to having mounted that > > BCacheFS test filesystem. I am tempted to replace the BCacheFS with a > > BTRFS just to find out. > > I think you should be looking elsewhere - there have been zero reports > of random crashes or anything like what you're describing. Even in > syzbot testing we've been pretty free from the kind of memory safety > issues that would cause random crashes Okay. From what I saw of the backtrace I am not sure it is a memory safety bug. It could be a deadlock thing with work queues. Anyway… as you can read below it is not BCacheFS related. But I understand too little about all of this to say for sure. > The closest bugs to what you're describing would be the > __wait_on_freeing_inode() deadlock in 6.12-rc1, and the LZ4HC crash that > I've yet to triage - but you specifically have to be using lz4:15 > compression to hit that path. Well a crash on reboot happened again, without BCacheFS. I wrote that I report back, either case. I think I will wait whether this goes away with a newer kernel as some of the other regressions I saw before. It was not in all of the 6.11 series of Debian kernels but just in the most recent one. In case it doesn't I may open a kernel bug report with Debian directly. For extra safety I did a memory test with memtest86+ 7.00. Zero errors. As for one of the other regressions I cannot tell yet, whether they have gone away. So far they did not occur again. But so far it looks that replacing BCacheFS with BTRFS does not make a difference. And I wanted to report that back. Best,
On Mon, Oct 07, 2024 at 05:01:55PM GMT, Jason A. Donenfeld wrote: > On Sun, Oct 06, 2024 at 03:29:51PM -0400, Kent Overstreet wrote: > > But - a big gap right now is endian /portability/, and that one is a > > pain to cover with automated tests because you either need access to > > both big and little endian hardware (at a minumm for creating test > > images), or you need to run qemu in full-emulation mode, which is pretty > > unbearably slow. > > It's really not that bad, at least for my use cases: > > https://www.wireguard.com/build-status/ > > This thing sends pings to my cellphone too. You can poke around in > tools/testing/selftests/wireguard/qemu/ if you're curious. It's kinda > gnarly but has proven very very flexible to hack up for whatever > additional testing I need. For example, I've been using it for some of > my recent non-wireguard work here: https://git.zx2c4.com/linux-rng/commit/?h=jd/vdso-test-harness > > Taking this straight-up probably won't fit for your filesystem work, but > maybe it can act as a bit of motivation that automated qemu'ing can > generally work. It has definitely caught a lot of silly bugs during > development time. I have all the qemu automation: https://evilpiepirate.org/git/ktest.git/ That's what I use for normal interactive development, i.e. I run something like build-test-kernel -I ~/ktest/tests/fs/bcachefs/replication.ktest rereplicate which builds a kernel, launches a VM and starts running a test; test output on stdout, I can ssh in, ctrl-c kills it like any other test. And those same tests are run automatically by my CI, which watches various git branches and produces results here: https://evilpiepirate.org/~testdashboard/ci?user=kmo&branch=bcachefs-testing (Why yes, thas is a lot of failing tests still.) I'm giving out accounts on this to anyone in the community doing kernel development, we've got fstests wrappers for every local filesystem, plus nfs, plus assorted other tests. Can always use more hardware if anyone wants to provide more machines.
On Mon, Oct 07, 2024 at 10:58:47AM GMT, Josef Bacik wrote: > I tend to ignore these kind of emails, it's been a decade and weirdly the file > system development community likes to use btrfs as a punching bag. I honestly > don't care what anybody else thinks, but I've gotten feedback from others in the > community that they wish I'd say something when somebody says things so patently > false. So I'm going to respond exactly once to this, and it'll be me satisfying > my quota for this kind of thing for the rest of the year. > > Btrfs is used by default in the desktop spin of Fedora, openSuse, and maybe some > others. Our development community is actively plugged into those places, we > drop everything to help when issues arise there. Btrfs is the foundation of the > Meta fleet. We rely on its capabilities and, most importantly of all, its > stability for our infrastructure. > > Is it perfect? Absolutely not. You will never hear me say that. I have often, > and publicly, said that Meta also uses XFS in our database workloads, because it > simply is just better than Btrfs at that. > > Yes, XFS is better at Btrfs at some things. I'm not afraid to admit that, > because my personal worth is not tied to the software projects I'm involved in. > Dave Chinner, Darrick Wong, Christoph Hellwig, Eric Sandeen, and many others have > done a fantastic job with XFS. I have a lot of respect for them and the work > they've done. I've learned a lot from them. > > Ext4 is better at Btrfs in a lot of those same things. Ted T'so, Andreas > Dilger, Jan Kara, and many others have done a fantastic job with ext4. > > I have learned a lot from all of these developers, all of these file systems, > and many others in this community. > > Bcachefs is doing new and interesting things. There are many things that I see > you do that I wish we had the foresight to know were going to be a problem with > Btrfs and done it differently. You, along with the wider file system community, > have a lot of the same ideals, same practices, and same desire to do your > absolute best work. That is an admirable trait, one that we all share. > > But dragging other people and their projects down is not the sort of behavior > that I think should have a place in this community. This is not the kind of > community I want to exist in. You are not the only person who does this, but > you are the most vocal and constant example of it. Just like I tell my kids, > just because somebody else is doing something wrong doesn't mean you get to do > it too. Josef, I've got to be honest with you, if 10 years in one filesystem still has a lot of user reports that clearly aren't being addressed where the filesystem is wedging itself, that's a pretty epic fail and that really is the main reason why I'm here. #1 priority in filesystem land has to be robustness. Not features, not performance; it has to simply work. The bar for "acceptably good" is really, really high when you're responsible for user's data. In the rest of the kernel, if you screw up, generally the worst that happens is you crash the machine - users are annoyed, whatever they were doing gets interrupted, but nothing drastically bad happens. In filesystem land, fairly minor screwups can lead to the entire machine being down for extended periods of time if the filesystem has wedged itself and really involved repair procedures that users _should not_ have to do, or worst real data loss. And you need to be thinking about the trust that users are placing in you; that's people's _lives_ they're storing on their machines. So no, based on the feedback I still _regularly_ get I don't think btrfs hit an acceptable level of reliability, and if it's taking this long I doubt it will. "Mostly works" is just not good enough. To be fair, bcachefs isn't "good enough" yet either, I'm still getting bug reports where bcachefs wedges itself too. But I've also been pretty explicit about that, and I'm not taking the experimental label off until those reports have stopped and we've addressed _every_ known way it can wedge itself and we've torture tested the absolute crap out of repair. And I think you've set the bar too low, by just accepting that btrfs isn't going to be as good as xfs in some situations. I don't think there's any reason a modern COW filesystem has to be crappier in _any_ respect than ext4/xfs. It's just a matter of prioritizing the essentials and working at it until it's done.
On Mon, Oct 07, 2024 at 03:59:17PM -0400, Kent Overstreet wrote: > On Mon, Oct 07, 2024 at 05:01:55PM GMT, Jason A. Donenfeld wrote: > > On Sun, Oct 06, 2024 at 03:29:51PM -0400, Kent Overstreet wrote: > > > But - a big gap right now is endian /portability/, and that one is a > > > pain to cover with automated tests because you either need access to > > > both big and little endian hardware (at a minumm for creating test > > > images), or you need to run qemu in full-emulation mode, which is pretty > > > unbearably slow. > > > > It's really not that bad, at least for my use cases: > > > > https://www.wireguard.com/build-status/ > > > > This thing sends pings to my cellphone too. You can poke around in > > tools/testing/selftests/wireguard/qemu/ if you're curious. It's kinda > > gnarly but has proven very very flexible to hack up for whatever > > additional testing I need. For example, I've been using it for some of > > my recent non-wireguard work here: https://git.zx2c4.com/linux-rng/commit/?h=jd/vdso-test-harness > > > > Taking this straight-up probably won't fit for your filesystem work, but > > maybe it can act as a bit of motivation that automated qemu'ing can > > generally work. It has definitely caught a lot of silly bugs during > > development time. > > I have all the qemu automation: > https://evilpiepirate.org/git/ktest.git/ Neat. I suppose you can try to hook up all the other archs to run in TCG there, and then you'll be able to test big endian and whatever other weird issues crop up.
On Sun, Oct 6, 2024 at 9:29 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > On Sun, Oct 06, 2024 at 12:04:45PM GMT, Linus Torvalds wrote: > > But build and boot testing? All those random configs, all those odd > > architectures, and all those odd compilers *do* affect build testing. > > So you as a filesystem maintainer should *not* generally strive to do > > your own basic build test, but very much participate in the generic > > build test that is being done by various bots (not just on linux-next, > > but things like the 0day bot on various patch series posted to the > > list etc). > > > > End result: one size does not fit all. But I get unhappy when I see > > some subsystem that doesn't seem to participate in what I consider the > > absolute bare minimum. > > So the big issue for me has been that with the -next/0day pipeline, I > have no visibility into when it finishes; which means it has to go onto > my mental stack of things to watch for and becomes yet another thing to > pipeline, and the more I have to pipeline the more I lose track of > things. FWIW, my understanding is that linux-next is not just infrastructure for CI bots. For example, there is also tooling based on -next that doesn't have such a thing as "done with processing" - my understanding is that syzkaller (https://syzkaller.appspot.com/upstream) has instances that fuzz linux-next ("ci-upstream-linux-next-kasan-gce-root").
On Sun, Oct 06, 2024 at 12:33:51AM -0400, Kent Overstreet wrote: > > Correct me if I'm wrong, but your system isn't available to the > community, and I haven't seen a CI or dashboard for kdevops? It's up on github for anyone to download, and I've provided pre-built test appliance so people don't have to have downloaded xfstests and all of its dependencies and build it from scratch. (That's been automated, of course, but the build infrastructure is setup to use a Debian build chroot, and with the precompiled test appliances, you can use my test runner on pretty much any Linux distribution; it will even work on MacOS if you have qemu built from macports, although for now you have to build the kernel on Linux distro using Parallels VM[1].) I'll note that IMHO making testing resources available to the community isn't really the bottleneck. Using cloud resources, especially if you spin up the VM's only when you need to run the tests, and shut them down once the test is complete, which gce-xfstests does, is actually quite cheap. At retail prices, running a dozen ext4 file system configurations against xfstests's "auto" group will take about 24 hours of VM time, and including the cost of the block devices, costs just under two dollars USD. Because the tests are run in parallel, the total wall clock time to run all of the tests is about two and a half hours. Running the "quick" group on a single file system configuration costs pennies. So the $300 of free GCE credits will actually get someone pretty far! No, the bottleneck is having someone knowledgeable enough to interpret the test results and then finding the root cause of the failures. This is one of the reasons why I haven't stressed all that much about dashboards. Dashboards are only useful if the right person(s) is looking at them. That's why I've been much more interested in making it stupidly easy to run tests on someone's local resources, e.g.: https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md In fact, for most people, the entry point that I envision as being most interesting is that they download the kvm-xfstests, and following the instructions in the quickstart, so they can run "kvm-xfstests smoke" before sending me an ext4 patch. Running the smoke test only takes 15 minutes using qemu, and it's much more convenient for them to run that on their local machine than to trigger the test on some remote machine, whether it's in the cloud or someone's remote test server. In any case, that's why I haven't been interesting in working with your test infrastructure; I have my own, and in my opinion, my approach is the better one to make available to the community, and so when I have time to improve it, I'd much rather work on {kvm,gce,android}-xfstests. Cheers, - Ted [1] Figuring out how to coerce the MacOS toolchain to build the Linux kernel would be cool if anyone ever figures it out. However, I *have* done kernel development using a Macbook Air M2 while on a cruise ship with limited internet access, building the kernel using a Parallels VM running Debian testing, and then using qemu from MacPorts to avoid the double virtualization performance penalty to run xfstests to test the freshly-built arm64 kernel, using my xfstests runner -- and all of this is available on github for anyone to use.
On Tue, Oct 08, 2024 at 10:51:39PM GMT, Theodore Ts'o wrote: > On Sun, Oct 06, 2024 at 12:33:51AM -0400, Kent Overstreet wrote: > > > > Correct me if I'm wrong, but your system isn't available to the > > community, and I haven't seen a CI or dashboard for kdevops? > > It's up on github for anyone to download, and I've provided pre-built > test appliance so people don't have to have downloaded xfstests and > all of its dependencies and build it from scratch. (That's been > automated, of course, but the build infrastructure is setup to use a > Debian build chroot, and with the precompiled test appliances, you can > use my test runner on pretty much any Linux distribution; it will even > work on MacOS if you have qemu built from macports, although for now > you have to build the kernel on Linux distro using Parallels VM[1].) How many steps are required, start to finish, to test a git branch and get the results? Compare that to my setup, where I give you an account, we set up the config file that lists tests to run and git branches to test, and then results show up in the dashboard. > I'll note that IMHO making testing resources available to the > community isn't really the bottleneck. Using cloud resources, > especially if you spin up the VM's only when you need to run the > tests, and shut them down once the test is complete, which > gce-xfstests does, is actually quite cheap. At retail prices, running > a dozen ext4 file system configurations against xfstests's "auto" > group will take about 24 hours of VM time, and including the cost of > the block devices, costs just under two dollars USD. Because the > tests are run in parallel, the total wall clock time to run all of the > tests is about two and a half hours. Running the "quick" group on a > single file system configuration costs pennies. So the $300 of free > GCE credits will actually get someone pretty far! That's the same argument that I've been making - machine resources are cheap these days. And using bare metal machines significantly simplifies the backend (watchdogs, catching full kernel and test output, etc.). > No, the bottleneck is having someone knowledgeable enough to interpret > the test results and then finding the root cause of the failures. > This is one of the reasons why I haven't stressed all that much about > dashboards. Dashboards are only useful if the right person(s) is > looking at them. That's why I've been much more interested in making > it stupidly easy to run tests on someone's local resources, e.g.: > > https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md Yes, it needs to be trivial to run the same test locally that gets run by the automated infrastructure, I've got that as well. But dashboards are important, as well. And the git log based dashboard I've got drastically reduces time spent manually bisecting. > In fact, for most people, the entry point that I envision as being > most interesting is that they download the kvm-xfstests, and following > the instructions in the quickstart, so they can run "kvm-xfstests > smoke" before sending me an ext4 patch. Running the smoke test only > takes 15 minutes using qemu, and it's much more convenient for them to > run that on their local machine than to trigger the test on some > remote machine, whether it's in the cloud or someone's remote test > server. > > In any case, that's why I haven't been interesting in working with > your test infrastructure; I have my own, and in my opinion, my > approach is the better one to make available to the community, and so > when I have time to improve it, I'd much rather work on > {kvm,gce,android}-xfstests. Well, my setup also isn't tied to xfstests, and it's fairly trivial to wrap all of our other (mm, block) tests. But like I said before, I don't particularly care which one wins, as long as we're pushing forward with something.
On Wed, Oct 09, 2024 at 12:17:35AM -0400, Kent Overstreet wrote: > How many steps are required, start to finish, to test a git branch and > get the results? See the quickstart doc. The TL;DR is (1) do the git clone, (2) "make ; make install" (this is just to set up the paths in the shell scripts and then copying it to your ~/bin directory, so this takes a second or so)", and then (3) "install-kconfig ; kbuild ; kvm-xfstests smoke" in your kernel tree. > But dashboards are important, as well. And the git log based dashboard > I've got drastically reduces time spent manually bisecting. gce-xfstests ltm -c ext4/1k generic/750 --repo ext4.git \ --bisect-bad dev --bisect-good origin With automated bisecting, I don't have to spend any of my personal time; I just wait for the results to show up in my inbox, without needing to refer to any dashboards. :-) > > In any case, that's why I haven't been interesting in working with > > your test infrastructure; I have my own, and in my opinion, my > > approach is the better one to make available to the community, and so > > when I have time to improve it, I'd much rather work on > > {kvm,gce,android}-xfstests. > > Well, my setup also isn't tied to xfstests, and it's fairly trivial to > wrap all of our other (mm, block) tests. Neither is mine; the name {kvm,gce,qemu,android}-xfstests is the same for historical reasons. I have blktests, ltp, stress-ng and the Phoronix Test Suites wired up (although using comparing against historical baselines with PTS is a bit manual at the moment). > But like I said before, I don't particularly care which one wins, as > long as we're pushing forward with something. I'd say that in the file system development community there has been a huge amount of interest in testing, because we all have a general consensus that testing is support important[1]. Most of us decided that the "There Can Be Only One" from the Highlander Movie is just not happening, because everyone's test infrastructures is optimized for their particular workflow, just as there's a really good reason why there are 75+ file systems in Linux, and half-dozen or so very popular general-purpose file systems. And that's a good thing. Cheers, - Ted [1] https://docs.google.com/presentation/d/14MKWxzEDZ-JwNh0zNUvMbQa5ZyArZFdblTcF5fUa7Ss/edit#slide=id.g1635d98056_0_45
On Wed Oct 9, 2024 at 5:51 AM CEST, Theodore Ts'o wrote: > On Sun, Oct 06, 2024 at 12:33:51AM -0400, Kent Overstreet wrote: >> >> Correct me if I'm wrong, but your system isn't available to the >> community, and I haven't seen a CI or dashboard for kdevops? > > It's up on github for anyone to download, and I've provided pre-built > test appliance so people don't have to have downloaded xfstests and > all of its dependencies and build it from scratch. (That's been > automated, of course, but the build infrastructure is setup to use a > Debian build chroot, and with the precompiled test appliances, you can > use my test runner on pretty much any Linux distribution; it will even > work on MacOS if you have qemu built from macports, although for now > you have to build the kernel on Linux distro using Parallels VM[1].) > > I'll note that IMHO making testing resources available to the > community isn't really the bottleneck. Using cloud resources, > especially if you spin up the VM's only when you need to run the > tests, and shut them down once the test is complete, which > gce-xfstests does, is actually quite cheap. At retail prices, running > a dozen ext4 file system configurations against xfstests's "auto" > group will take about 24 hours of VM time, and including the cost of > the block devices, costs just under two dollars USD. Because the > tests are run in parallel, the total wall clock time to run all of the > tests is about two and a half hours. Running the "quick" group on a > single file system configuration costs pennies. So the $300 of free > GCE credits will actually get someone pretty far! > > No, the bottleneck is having someone knowledgeable enough to interpret > the test results and then finding the root cause of the failures. > This is one of the reasons why I haven't stressed all that much about > dashboards. Dashboards are only useful if the right person(s) is > looking at them. That's why I've been much more interested in making > it stupidly easy to run tests on someone's local resources, e.g.: > > https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md > > In fact, for most people, the entry point that I envision as being > most interesting is that they download the kvm-xfstests, and following > the instructions in the quickstart, so they can run "kvm-xfstests > smoke" before sending me an ext4 patch. Running the smoke test only > takes 15 minutes using qemu, and it's much more convenient for them to > run that on their local machine than to trigger the test on some > remote machine, whether it's in the cloud or someone's remote test > server. > > In any case, that's why I haven't been interesting in working with > your test infrastructure; I have my own, and in my opinion, my > approach is the better one to make available to the community, and so > when I have time to improve it, I'd much rather work on > {kvm,gce,android}-xfstests. > > Cheers, > > - Ted > > > [1] Figuring out how to coerce the MacOS toolchain to build the Linux > kernel would be cool if anyone ever figures it out. However, I *have* Building Linux for arm64 is now supported in macOS. You can find all patch series discussions here [1]. In case you want to give this a try, here the steps: ```shell diskutil apfs addVolume /dev/disk<N> "Case-sensitive APFS" linux ``` ```shell brew install coreutils findutils gnu-sed gnu-tar grep llvm make pkg-config ``` ```shell brew tap bee-headers/bee-headers brew install bee-headers/bee-headers/bee-headers ``` Initialize the environment with `bee-init`. Repeat with every new shell: ```shell source bee-init ``` ```shell make LLVM=1 defconfig make LLVM=1 -j$(nproc) ``` More details about the setup required can be found here [2]. This allows to build the kernel and boot it with QEMU -kernel argument. And debug it with with lldb. [1] v3: https://lore.kernel.org/all/20240925-macos-build-support-v3-1-233dda880e60@samsung.com/ v2: https://lore.kernel.org/all/20240906-macos-build-support-v2-0-06beff418848@samsung.com/ v1: https://lore.kernel.org/all/20240807-macos-build-support-v1-0-4cd1ded85694@samsung.com/ [2] https://github.com/bee-headers/homebrew-bee-headers/blob/main/README.md Daniel > done kernel development using a Macbook Air M2 while on a cruise ship > with limited internet access, building the kernel using a Parallels VM > running Debian testing, and then using qemu from MacPorts to avoid the > double virtualization performance penalty to run xfstests to test the > freshly-built arm64 kernel, using my xfstests runner -- and all of > this is available on github for anyone to use.