Message ID | CAJSP0QUk77GViTBgBpfYH-AbAmQ5aUwi0K6UTH9iv=1mVb0Wbw@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Kubernetes gitlab-runner jobs cannot be scheduled | expand |
On 3/1/25 07:19, Stefan Hajnoczi wrote: > Hi, > On February 26th GitLab CI started failing many jobs because they > could not be scheduled. I've been unable to merge pull requests > because the CI is not working. > > Here is an example failed job: > https://gitlab.com/qemu-project/qemu/-/jobs/9281757413 Hi Stefan, until February 26th the Digital Ocean runners were not enabled; I tried enabling them (which is what caused the issue) to start gauging how much credit we would need to be able to move from Azure to DO for CI. I posted a note on IRC, I'm sorry if you missed that. > The cache PVC appears to be a manual addition made to the running > cluster but not committed to qemu.git. I don't understand why the > problems only started surfacing now. Maybe a recent .gitlab-ci.d/ > change changed how the timeout behaves or maybe the gitlab-runner > configuration that enables the cache PVC simply wasn't picked up by > the gitlab-runner Pod until February 26th? Almost: the cache is not used on Azure, which is why it works. > In the short term I made a manual edit to the ConfigMap removing > gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs > are at least running now, although they may take longer due to the > lack of cache. Ok, thanks for debugging that. I think what you did is right, and the caching setup should be tested more on a secondary cluster. (As to the DO credits numbers, the cost of the k8s cluster is about $75/month, and since we were granted $2000 in credits we have only $1100/year to spend on the actual jobs. The plan is to check on the credits left at the end of March and bring our estimates to DO's open source program manager). Paolo > In the long term maybe we should deploy minio > (https://github.com/minio/minio) or another Kubernetes S3-like service > so gitlab-runner can properly use a global cache without ReadWriteOnce > limitations? > > Since I don't know the details of how the Digital Ocean Kubernetes > cluster was configured for gitlab-runner I don't want to make too many > changes without your input. Please let me know what you think. > > Stefan > >
On Sat, Mar 1, 2025 at 2:36 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 3/1/25 07:19, Stefan Hajnoczi wrote: > > Hi, > > On February 26th GitLab CI started failing many jobs because they > > could not be scheduled. I've been unable to merge pull requests > > because the CI is not working. > > > > Here is an example failed job: > > https://gitlab.com/qemu-project/qemu/-/jobs/9281757413 > > Hi Stefan, > > until February 26th the Digital Ocean runners were not enabled; I tried > enabling them (which is what caused the issue) to start gauging how much > credit we would need to be able to move from Azure to DO for CI. I > posted a note on IRC, I'm sorry if you missed that. > > > The cache PVC appears to be a manual addition made to the running > > cluster but not committed to qemu.git. I don't understand why the > > problems only started surfacing now. Maybe a recent .gitlab-ci.d/ > > change changed how the timeout behaves or maybe the gitlab-runner > > configuration that enables the cache PVC simply wasn't picked up by > > the gitlab-runner Pod until February 26th? > > Almost: the cache is not used on Azure, which is why it works. > > > In the short term I made a manual edit to the ConfigMap removing > > gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs > > are at least running now, although they may take longer due to the > > lack of cache. > > Ok, thanks for debugging that. I think what you did is right, and the > caching setup should be tested more on a secondary cluster. Glad the change is acceptable and didn't break things more. > (As to the DO credits numbers, the cost of the k8s cluster is about > $75/month, and since we were granted $2000 in credits we have only > $1100/year to spend on the actual jobs. The plan is to check on the > credits left at the end of March and bring our estimates to DO's open > source program manager). This reminds me I received an email asking for feedback regarding QEMU's Amazon credits. Just wanted to mention they are there if we need them. Stefan
On Sat, Mar 1, 2025 at 2:36 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 3/1/25 07:19, Stefan Hajnoczi wrote: > > Hi, > > On February 26th GitLab CI started failing many jobs because they > > could not be scheduled. I've been unable to merge pull requests > > because the CI is not working. > > > > Here is an example failed job: > > https://gitlab.com/qemu-project/qemu/-/jobs/9281757413 > > Hi Stefan, > > until February 26th the Digital Ocean runners were not enabled; I tried > enabling them (which is what caused the issue) to start gauging how much > credit we would need to be able to move from Azure to DO for CI. I > posted a note on IRC, I'm sorry if you missed that. There is a new type of timeout failure: https://gitlab.com/qemu-project/qemu/-/jobs/9288349332 GitLab says: "There has been a timeout failure or the job got stuck. Check your timeout limits or try again" Duration: 77 minutes 13 seconds Timeout: 1h (from project) It ran 17 minutes longer than the job timeout. Any idea? Stefan
On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > GitLab says: > "There has been a timeout failure or the job got stuck. Check your > timeout limits or try again" > > Duration: 77 minutes 13 seconds > Timeout: 1h (from project) > > It ran 17 minutes longer than the job timeout. The job only seems to have run for roughly 15-20 minutes. I am not sure what's going on, but I have opened a ticket with DO to request both larger droplets (16 vCPU / 32 GB) and a higher limit (25 droplets). This matches roughly what was available on Azure. Let me know if you prefer to go back to Azure for the time being. Paolo
On Mon, Mar 3, 2025 at 5:26 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > GitLab says: > > "There has been a timeout failure or the job got stuck. Check your > > timeout limits or try again" > > > > Duration: 77 minutes 13 seconds > > Timeout: 1h (from project) > > > > It ran 17 minutes longer than the job timeout. > > The job only seems to have run for roughly 15-20 minutes. > > I am not sure what's going on, but I have opened a ticket with DO to > request both larger droplets (16 vCPU / 32 GB) and a higher limit (25 > droplets). This matches roughly what was available on Azure. > > Let me know if you prefer to go back to Azure for the time being. Yes, please. I'm unable to merge pull requests (with a clear conscience at least) because running CI to completion is taking so long with many manual retries needed. Perhaps the timeouts will go away once the droplet size is increased. It makes sense that running the jobs on different hardware might require readjusting timeouts. Thanks, Stefan
On Mon, Mar 03, 2025 at 07:01:16PM +0800, Stefan Hajnoczi wrote: > On Mon, Mar 3, 2025 at 5:26 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > GitLab says: > > > "There has been a timeout failure or the job got stuck. Check your > > > timeout limits or try again" > > > > > > Duration: 77 minutes 13 seconds > > > Timeout: 1h (from project) > > > > > > It ran 17 minutes longer than the job timeout. > > > > The job only seems to have run for roughly 15-20 minutes. > > > > I am not sure what's going on, but I have opened a ticket with DO to > > request both larger droplets (16 vCPU / 32 GB) and a higher limit (25 > > droplets). This matches roughly what was available on Azure. > > > > Let me know if you prefer to go back to Azure for the time being. > > Yes, please. I'm unable to merge pull requests (with a clear > conscience at least) because running CI to completion is taking so > long with many manual retries needed. > > Perhaps the timeouts will go away once the droplet size is increased. > It makes sense that running the jobs on different hardware might > require readjusting timeouts. It is a bit surprising to see timeouts, as we've fine tuned our test timeouts to cope with GitLab's default shared runners, which is what contributors use when CI runs in a fork. These runners only have 2 VCPUs and 8 GB of RAM, so that's a pretty low resource baseline With regards, Daniel
--- /tmp/upstream.yaml 2025-03-01 12:47:40.495216401 +0800 +++ /tmp/deployed.yaml 2025-03-01 12:47:38.884216210 +0800 @@ -9,6 +9,7 @@ [runners.kubernetes] poll_timeout = 1200 image = "ubuntu:20.04" + privileged = true cpu_request = "0.5" service_cpu_request = "0.5" helper_cpu_request = "0.25" @@ -18,5 +19,6 @@ name = "docker-certs" mount_path = "/certs/client" medium = "Memory" - [runners.kubernetes.node_selector] - agentpool = "jobs" + [[runners.kubernetes.volumes.pvc]] + name = "gitlab-cache-pvc" + mount_path = "/cache"