Skip to content

Commit

Permalink
Creating a "NodeNotReady" Alert (#2393)
Browse files Browse the repository at this point in the history
* changing the folder name to be plural and added a new node monitor to test if there is a node stuck in the "notready" state.

* moving argo node-monitors into their own folder as we will only want to apply these monitors in environments with argo workflows. Also, edited the cron timing for the node-not-ready cron.

* changing the "Application.spec.source.directory.exclude" to not be an array.

* "exclude" does not support arrays, so reverting this change as it is also not necessary
  • Loading branch information
EliseCastle23 authored Oct 30, 2023
1 parent 3ff37b9 commit e2dc592
Show file tree
Hide file tree
Showing 5 changed files with 68 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ spec:
source:
repoURL: https://github.com/uc-cdis/cloud-automation.git
targetRevision: master
path: kube/services/node-monitor
path: kube/services/node-monitors/
directory:
exclude: "application.yaml"
syncPolicy:
Expand Down
22 changes: 22 additions & 0 deletions kube/services/node-monitors/argo-monitors/application.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: node-monitor-argo-application
namespace: argocd
spec:
destination:
namespace: default
server: https://kubernetes.default.svc
project: default
source:
repoURL: https://github.com/uc-cdis/cloud-automation.git
targetRevision: master
path: kube/services/node-monitors/argo-monitors/
directory:
exclude: "application.yaml"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: node-monitor-cron
name: argo-node-age
namespace: default
spec:
schedule: "*/5 * * * *"
Expand Down Expand Up @@ -55,4 +55,4 @@ spec:
curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"WARNING: Node \`${NODE_NAME}\` is older than 3 hours!\"}" $SLACK_WEBHOOK_URL
fi
done
restartPolicy: OnFailure
restartPolicy: OnFailure
File renamed without changes.
43 changes: 43 additions & 0 deletions kube/services/node-monitors/node-not-ready.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
apiVersion: batch/v1
kind: CronJob
metadata:
name: node-not-ready-cron
namespace: default
spec:
schedule: "*/30 * * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: gen3job
spec:
serviceAccountName: node-monitor
containers:
- name: kubectl
image: quay.io/cdis/awshelper
env:
- name: SLACK_WEBHOOK_URL
valueFrom:
configMapKeyRef:
name: global
key: slack_webhook

command: ["/bin/bash"]
args:
- "-c"
- |
#!/bin/sh
# Get nodes that show "NodeStatusNeverUpdated"
NODES=$(kubectl get nodes -o json | jq -r '.items[] | select(.status.conditions[] | select(.type == "Ready" and .status == "Unknown")) | .metadata.name')
if [ -n "$NODES" ]; then
echo "Nodes reporting 'NodeStatusNeverUpdated', sending an alert:"
echo "$NODES"
# Send alert to Slack
curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"WARNING: Node \`${NODES}\` is stuck in "NotReady"!\"}" $SLACK_WEBHOOK_URL
else
echo "No nodes reporting 'NodeStatusNeverUpdated'"
fi
restartPolicy: OnFailure

0 comments on commit e2dc592

Please sign in to comment.