Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What does FluentdRecordsCountHigh alert imply? #89

Open
Ghazgkull opened this issue Aug 13, 2021 · 6 comments
Open

Question: What does FluentdRecordsCountHigh alert imply? #89

Ghazgkull opened this issue Aug 13, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@Ghazgkull
Copy link
Contributor

Is your feature request related to a problem? Please describe.

After enabling the out of the box PrometheusRules provided by this chart, the FluentdRecordsCountHigh alert has briefly fired and resolved itself twice in the first couple hours.

There is documentation on this alert which reads:

      summary: fluentd records count are critical
      description: In the last 5m, records counts increased 3 times, comparing to the latest 15 min.

Unfortunately, I'm not able to understand what this means. Why is this a "critical" problem?

Describe the solution you'd like

If I'm able to understand what this alert means, I'd like to suggest a documentation change to make it more easily understood.

@Ghazgkull Ghazgkull added the enhancement New feature or request label Aug 13, 2021
@Ghazgkull
Copy link
Contributor Author

A little follow-up. This alert continued to fire and resolve in our deployments, which were actually operating just fine. We ended up having to write scripting to patch the PrometheusRules in our cluster after the helm chart deploys in order to remove the FluentdRecordsCountHigh alert. Because those alerts live nested inside two arrays in the PrometheusRules object, it is a non-trivial bit of scripting.

I'm left wondering if anyone understands this alert and is getting any value out of it. Or if we should just remove it from the chart. @monotek Any thoughts?

@monotek
Copy link
Member

monotek commented Sep 20, 2021

Why don´t you just disable the alert if you don't need it?
https://github.com/kokuwaio/helm-charts/blob/main/charts/fluentd-elasticsearch/values.yaml#L314-L325

@Ghazgkull
Copy link
Contributor Author

Is there something in the helm chart configuration that would allow me to disable this one alert? Maybe I'm being dense, but I don't see anything in the code you linked which would allow me to disable that alert?

As I mentioned above, I've written some scripting which modifies the PrometheusRule CRD after it's deployed to remove the alert. But it's complicated and just raised the question for me of what the alert is even for.

@monotek
Copy link
Member

monotek commented Sep 20, 2021

Just overwrite it in your values.yaml

prometheusRule:
  enabled: true
  prometheusNamespace: monitoring
  rules:
  - alert: FluentdNodeDown
    expr: up{job="{{ include "fluentd-elasticsearch.metricsServiceName" . }}"} == 0
    for: 10m
    labels:
      service: fluentd
      severity: warning
    annotations:
      summary: fluentd cannot be scraped
      description: Prometheus could not scrape {{ "{{ $labels.job }}" }} for more than 10 minutes
  - alert: FluentdNodeDown
    expr: up{job="{{ include "fluentd-elasticsearch.metricsServiceName" . }}"} == 0
    for: 30m
    labels:
      service: fluentd
      severity: critical
    annotations:
      summary: fluentd cannot be scraped
      description: Prometheus could not scrape {{ "{{ $labels.job }}" }} for more than 30 minutes
  - alert: FluentdQueueLength
    expr: rate(fluentd_status_buffer_queue_length[5m]) > 0.3
    for: 1m
    labels:
      service: fluentd
      severity: warning
    annotations:
      summary: fluentd node are failing
      description: In the last 5 minutes, fluentd queues increased 30%. Current value is {{ "{{ $value }}" }}
  - alert: FluentdQueueLength
    expr: rate(fluentd_status_buffer_queue_length[5m]) > 0.5
    for: 1m
    labels:
      service: fluentd
      severity: critical
    annotations:
      summary: fluentd node are critical
      description: In the last 5 minutes, fluentd queues increased 50%. Current value is {{ "{{ $value }}" }}
  - alert: FluentdRetry
    expr: increase(fluentd_status_retry_count[10m]) > 0
    for: 20m
    labels:
      service: fluentd
      severity: warning
    annotations:
      description: Fluentd retry count has been  {{ "{{ $value }}" }} for the last 10 minutes
      summary: Fluentd retry count has been  {{ "{{ $value }}" }} for the last 10 minutes
  - alert: FluentdOutputError
    expr: increase(fluentd_output_status_num_errors[10m]) > 0
    for: 1s
    labels:
      service: fluentd
      severity: warning
    annotations:
      description: Fluentd output error count is {{ "{{ $value }}" }} for the last 10 minutes
      summary: There have been Fluentd output error(s) for the last 10 minutes

@Ghazgkull
Copy link
Contributor Author

Ghazgkull commented Sep 21, 2021

@monotek Sure, thanks. I appreciate the help, but I do have a solution in hand which works for me with a different set of tradeoffs.

My request here is for a clarification of the docs, though. My question is What does "FluentdRecordsCountHigh" actually mean? I just don't understand the description on the alert. Does anyone know what value this alert provides? If so, I'd be happy to update the doc with a PR.

@monotek
Copy link
Member

monotek commented Sep 21, 2021

The alarm is copiedf from: https://github.com/fluent/fluent-plugin-prometheus/blob/master/misc/prometheus_alerts.yaml#L49-L59
Imho it just means "hey, you get unusal more logs as normaly".
But i guess it's the best to ask in the repo mentioned above to get clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants