orfeas-k
(Orfeas Kourkakis)
24 April 2024 13:32
1
The following Charmed Kubeflow charms provide default alerts to facilitate their monitoring. For more information on alert rules and how they are defined, see the corresponding Prometheus documentation .
Contents:
The alert tables below use the following columns:
Alert: The name of the alert inside Prometheus dashboard.
Description: When the alert goes into Firing
state.
Severity: The severity of each alert.
Alert
Description
Severity
ArgoWorkflowErrorLoglines
There are more than 10 new Error log lines every minute for at least the past 4 minutes.
Critical
ArgoWorkflowWarningLoglines
There are more than 40 new Warning log lines every minute for at least the past 4 minutes.
Warning
ArgoUnitIsUnavailable
The argo-controller unit is down for the past 5 minutes
Critical
ArgoWorkflowsErroring
At least one more argo workflow went in Error status every minute for at least the past 10 minutes.
Warning
ArgoWorkflowsFailed
At least one more argo workflow went in Failed status every minute for at least the past 10 minutes.
Warning
ArgoWorkflowsPending
At least one more argo workflow went in Pending status every minute for at least the past 10 minutes.
Warning
Alert
Description
Severity
DexAuthUnitIsUnavailable
The dex-auth unit is down for at least the past 5 minutes.
Critical
Alert
Description
Severity
EnvoyUnitIsUnavailable
The envoy unit is down during the last 1 minute.
Critical
Alert
Description
Severity
UnfinishedWorkQueueAlert
The amount of unfinished work in the workqueue has increased significantly during the past 5 minutes.
Critical
FileDescriptorsExhausted
The file descriptors have reached 98% of the maximum available.
Critical
FileDescriptorsSoonToBeExhausted
The file descriptors are predicted to be exhausted 1 hour later.
High
JupyterControllerRuntimeReconciliationErrorsExceedThreshold
Total number of controller runtime reconciliation errors has increased during the past 5 minutes.
Critical
JupyterControllerUnitIsUnavailable
The jupyter-controller unit is down for at least the past 5 minutes.
Critical
Alert
Description
Severity
KatibControllerUnitIsUnavailable
The katib-controller unit is down during the last 1 minute.
Critical
Alert
Description
Severity
KfpApiUnitIsUnavailable
The kfp-api unit is down during the last 1 minute.
Critical
Alert
Description
Severity
MetacontrollerUnitIsUnavailable
The metacontroller-operator unit is down for at least the past 5 minutes.
Critical
Alert
Description
Severity
MinioUnitIsUnavailable
The minio unit is down for at least the past 5 minutes.
Critical
Alert
Description
Severity
SeldonWorkqueueTooManyRetries
Total number of retries handled by workqueue has increased during the past 10 minutes.
Critical
SeldonHTTPError
Number of HTTP requests with status code 4XX has increased during the past 10 minutes.
Critical
SeldonReconcileError
Total number of controller runtime reconciliations that resulted in error has increased during the past 10 minutes.
Critical
SeldonUnfinishedWorkIncrease
The amount of unfinished work in the workqueue has increased during the past 10 minutes.
Critical
SeldonWebhookError
Total number of admission HTTP requests with status code 5XX has increased during the past 10 minutes.
Critical
SeldonUnitIsUnavailable
The seldon-controller-manager unit is down for at least the past 5 minutes.
Critical
Alert
Description
Severity
TrainingOperatorUnitIsUnavailable
The training-operator unit is down for at least the past 5 minutes.
Critical