Kubernetes gitlab-runner jobs cannot be scheduled

Message ID	CAJSP0QUk77GViTBgBpfYH-AbAmQ5aUwi0K6UTH9iv=1mVb0Wbw@mail.gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> MIME-Version: 1.0 From: Stefan Hajnoczi <stefanha@gmail.com> Date: Sat, 1 Mar 2025 14:19:15 +0800 Message-ID: <CAJSP0QUk77GViTBgBpfYH-AbAmQ5aUwi0K6UTH9iv=1mVb0Wbw@mail.gmail.com> Subject: Kubernetes gitlab-runner jobs cannot be scheduled To: Paolo Bonzini <pbonzini@redhat.com>, Camilla Conte <cconte@redhat.com> Cc: Thomas Huth <thuth@redhat.com>, qemu-devel <qemu-devel@nongnu.org> Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2a00:1450:4864:20::534; envelope-from=stefanha@gmail.com; helo=mail-ed1-x534.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	Kubernetes gitlab-runner jobs cannot be scheduled \| expand Kubernetes gitlab-runner jobs cannot be scheduled

Stefan Hajnoczi March 1, 2025, 6:19 a.m. UTC

Hi,
On February 26th GitLab CI started failing many jobs because they
could not be scheduled. I've been unable to merge pull requests
because the CI is not working.

Here is an example failed job:
https://gitlab.com/qemu-project/qemu/-/jobs/9281757413

One issue seems to be that the gitlab-cache-pvc PVC is ReadWriteOnce
and Pods scheduled on new nodes therefore cannot start until existing
Pods running on another node complete, causing gitlab-runner timeouts
and failed jobs.

When trying to figure out how the Digital Ocean Kubernetes cluster is
configured I noticed that the
digitalocean-runner-manager-gitlab-runner ConfigMap created on
2024-12-03 does not match qemu.git's
scripts/ci/gitlab-kubernetes-runners/values.yaml. Here is the diff:

The cache PVC appears to be a manual addition made to the running
cluster but not committed to qemu.git. I don't understand why the
problems only started surfacing now. Maybe a recent .gitlab-ci.d/
change changed how the timeout behaves or maybe the gitlab-runner
configuration that enables the cache PVC simply wasn't picked up by
the gitlab-runner Pod until February 26th?

In the short term I made a manual edit to the ConfigMap removing
gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs
are at least running now, although they may take longer due to the
lack of cache.

In the long term maybe we should deploy minio
(https://github.com/minio/minio) or another Kubernetes S3-like service
so gitlab-runner can properly use a global cache without ReadWriteOnce
limitations?

Since I don't know the details of how the Digital Ocean Kubernetes
cluster was configured for gitlab-runner I don't want to make too many
changes without your input. Please let me know what you think.

Stefan

Paolo Bonzini March 1, 2025, 6:36 a.m. UTC | #1

On 3/1/25 07:19, Stefan Hajnoczi wrote:
> Hi,
> On February 26th GitLab CI started failing many jobs because they
> could not be scheduled. I've been unable to merge pull requests
> because the CI is not working.
> 
> Here is an example failed job:
> https://gitlab.com/qemu-project/qemu/-/jobs/9281757413

Hi Stefan,

until February 26th the Digital Ocean runners were not enabled; I tried 
enabling them (which is what caused the issue) to start gauging how much 
credit we would need to be able to move from Azure to DO for CI.  I 
posted a note on IRC, I'm sorry if you missed that.

> The cache PVC appears to be a manual addition made to the running
> cluster but not committed to qemu.git. I don't understand why the
> problems only started surfacing now. Maybe a recent .gitlab-ci.d/
> change changed how the timeout behaves or maybe the gitlab-runner
> configuration that enables the cache PVC simply wasn't picked up by
> the gitlab-runner Pod until February 26th?

Almost: the cache is not used on Azure, which is why it works.

> In the short term I made a manual edit to the ConfigMap removing
> gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs
> are at least running now, although they may take longer due to the
> lack of cache.

Ok, thanks for debugging that.  I think what you did is right, and the 
caching setup should be tested more on a secondary cluster.

(As to the DO credits numbers, the cost of the k8s cluster is about 
$75/month, and since we were granted $2000 in credits we have only 
$1100/year to spend on the actual jobs.  The plan is to check on the 
credits left at the end of March and bring our estimates to DO's open 
source program manager).

Paolo

> In the long term maybe we should deploy minio
> (https://github.com/minio/minio) or another Kubernetes S3-like service
> so gitlab-runner can properly use a global cache without ReadWriteOnce
> limitations?
> 
> Since I don't know the details of how the Digital Ocean Kubernetes
> cluster was configured for gitlab-runner I don't want to make too many
> changes without your input. Please let me know what you think.
> 
> Stefan
> 
>

Stefan Hajnoczi March 1, 2025, 7:27 a.m. UTC | #2

On Sat, Mar 1, 2025 at 2:36 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 3/1/25 07:19, Stefan Hajnoczi wrote:
> > Hi,
> > On February 26th GitLab CI started failing many jobs because they
> > could not be scheduled. I've been unable to merge pull requests
> > because the CI is not working.
> >
> > Here is an example failed job:
> > https://gitlab.com/qemu-project/qemu/-/jobs/9281757413
>
> Hi Stefan,
>
> until February 26th the Digital Ocean runners were not enabled; I tried
> enabling them (which is what caused the issue) to start gauging how much
> credit we would need to be able to move from Azure to DO for CI.  I
> posted a note on IRC, I'm sorry if you missed that.
>
> > The cache PVC appears to be a manual addition made to the running
> > cluster but not committed to qemu.git. I don't understand why the
> > problems only started surfacing now. Maybe a recent .gitlab-ci.d/
> > change changed how the timeout behaves or maybe the gitlab-runner
> > configuration that enables the cache PVC simply wasn't picked up by
> > the gitlab-runner Pod until February 26th?
>
> Almost: the cache is not used on Azure, which is why it works.
>
> > In the short term I made a manual edit to the ConfigMap removing
> > gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs
> > are at least running now, although they may take longer due to the
> > lack of cache.
>
> Ok, thanks for debugging that.  I think what you did is right, and the
> caching setup should be tested more on a secondary cluster.

Glad the change is acceptable and didn't break things more.

> (As to the DO credits numbers, the cost of the k8s cluster is about
> $75/month, and since we were granted $2000 in credits we have only
> $1100/year to spend on the actual jobs.  The plan is to check on the
> credits left at the end of March and bring our estimates to DO's open
> source program manager).

This reminds me I received an email asking for feedback regarding
QEMU's Amazon credits. Just wanted to mention they are there if we
need them.

Stefan

Stefan Hajnoczi March 3, 2025, 7:35 a.m. UTC | #3

On Sat, Mar 1, 2025 at 2:36 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 3/1/25 07:19, Stefan Hajnoczi wrote:
> > Hi,
> > On February 26th GitLab CI started failing many jobs because they
> > could not be scheduled. I've been unable to merge pull requests
> > because the CI is not working.
> >
> > Here is an example failed job:
> > https://gitlab.com/qemu-project/qemu/-/jobs/9281757413
>
> Hi Stefan,
>
> until February 26th the Digital Ocean runners were not enabled; I tried
> enabling them (which is what caused the issue) to start gauging how much
> credit we would need to be able to move from Azure to DO for CI.  I
> posted a note on IRC, I'm sorry if you missed that.

There is a new type of timeout failure:
https://gitlab.com/qemu-project/qemu/-/jobs/9288349332

GitLab says:
"There has been a timeout failure or the job got stuck. Check your
timeout limits or try again"

Duration: 77 minutes 13 seconds
Timeout: 1h (from project)

It ran 17 minutes longer than the job timeout.

Any idea?

Stefan

Paolo Bonzini March 3, 2025, 9:25 a.m. UTC | #4

On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> GitLab says:
> "There has been a timeout failure or the job got stuck. Check your
> timeout limits or try again"
>
> Duration: 77 minutes 13 seconds
> Timeout: 1h (from project)
>
> It ran 17 minutes longer than the job timeout.

The job only seems to have run for roughly 15-20 minutes.

I am not sure what's going on, but I have opened a ticket with DO to
request both larger droplets (16 vCPU / 32 GB) and a higher limit (25
droplets). This matches roughly what was available on Azure.

Let me know if you prefer to go back to Azure for the time being.

Paolo

Stefan Hajnoczi March 3, 2025, 11:01 a.m. UTC | #5

On Mon, Mar 3, 2025 at 5:26 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > GitLab says:
> > "There has been a timeout failure or the job got stuck. Check your
> > timeout limits or try again"
> >
> > Duration: 77 minutes 13 seconds
> > Timeout: 1h (from project)
> >
> > It ran 17 minutes longer than the job timeout.
>
> The job only seems to have run for roughly 15-20 minutes.
>
> I am not sure what's going on, but I have opened a ticket with DO to
> request both larger droplets (16 vCPU / 32 GB) and a higher limit (25
> droplets). This matches roughly what was available on Azure.
>
> Let me know if you prefer to go back to Azure for the time being.

Yes, please. I'm unable to merge pull requests (with a clear
conscience at least) because running CI to completion is taking so
long with many manual retries needed.

Perhaps the timeouts will go away once the droplet size is increased.
It makes sense that running the jobs on different hardware might
require readjusting timeouts.

Thanks,
Stefan

Daniel P. Berrangé March 3, 2025, 1:11 p.m. UTC | #6

On Mon, Mar 03, 2025 at 07:01:16PM +0800, Stefan Hajnoczi wrote:
> On Mon, Mar 3, 2025 at 5:26 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> > On Mon, Mar 3, 2025 at 8:35 AM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > GitLab says:
> > > "There has been a timeout failure or the job got stuck. Check your
> > > timeout limits or try again"
> > >
> > > Duration: 77 minutes 13 seconds
> > > Timeout: 1h (from project)
> > >
> > > It ran 17 minutes longer than the job timeout.
> >
> > The job only seems to have run for roughly 15-20 minutes.
> >
> > I am not sure what's going on, but I have opened a ticket with DO to
> > request both larger droplets (16 vCPU / 32 GB) and a higher limit (25
> > droplets). This matches roughly what was available on Azure.
> >
> > Let me know if you prefer to go back to Azure for the time being.
> 
> Yes, please. I'm unable to merge pull requests (with a clear
> conscience at least) because running CI to completion is taking so
> long with many manual retries needed.
> 
> Perhaps the timeouts will go away once the droplet size is increased.
> It makes sense that running the jobs on different hardware might
> require readjusting timeouts.

It is a bit surprising to see timeouts, as we've fine tuned our test
timeouts to cope with GitLab's default shared runners, which is what
contributors use when CI runs in  a fork. These runners only have
2 VCPUs and 8 GB of RAM, so that's a pretty low resource baseline

With regards,
Daniel

Kubernetes gitlab-runner jobs cannot be scheduled

Commit Message

Comments

Patch