Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reporting Histogram Sum Metric #208

Open
shivajividhale opened this issue Feb 7, 2023 · 2 comments
Open

Reporting Histogram Sum Metric #208

shivajividhale opened this issue Feb 7, 2023 · 2 comments

Comments

@shivajividhale
Copy link

Hey folks,

I'm trying to get some clarification on Histogram reporting for prometheus metrics. A sample histogram metric in Prometheus format looks like

# HELP http_request_duration_seconds Api requests response time in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_sum{api="add_product" instance="host1.domain.com"} 8953.332
http_request_duration_seconds_count{api="add_product" instance="host1.domain.com"} 27892
http_request_duration_seconds_bucket{api="add_product" instance="host1.domain.com" le="0"}
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.01"} 0
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.025"} 8
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.05"} 1672
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.1"} 8954
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.25"} 14251
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="0.5"} 24101
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="1"} 26351
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="2.5"} 27534
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="5"} 27814
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="10"} 27881
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="25"} 27890
http_request_duration_seconds_bucket{api="add_product", instance="host1.domain.com", le="+Inf"} 27892

From our exercise, we are seeing that <histogram_metric>_sum metric is being ingested but not published/reported using the Histogram struct. As a result, we are not able to calculate averages.

If I'm not mistaken, this suggests scope for a patch to support _sum metric (http_request_duration_seconds_sum from the above example). And an addition of a field such as sum float64 to the above struct.

To avoid introducing breaking changes, can we add another method func (h *histogram) RecordSum(sum float64) which will allow us to update the Histogram struct.

Please comment on the reasonability of this issue. And if there's something that we are missing. If this seems reasonable, my team will be happy to contribute with a patch upstream.

Thanks

@shivajividhale
Copy link
Author

@brawndou any suggestions for this? In addition, I'm also trying to see if there is support for summary metrics. For eg. zookeeper provides metrics in Prometheus format such as

# HELP inflight_diff_count inflight_diff_count
# TYPE inflight_diff_count summary
inflight_diff_count{quantile="0.5",} NaN
inflight_diff_count_count 16.0
inflight_diff_count_sum 28.0

And

# HELP learner_handler_qp_time_ms learner_handler_qp_time_ms
# TYPE learner_handler_qp_time_ms summary
learner_handler_qp_time_ms{key="1",quantile="0.5",} 0.0
learner_handler_qp_time_ms{key="1",quantile="0.9",} 0.0
learner_handler_qp_time_ms{key="1",quantile="0.99",} 0.0
learner_handler_qp_time_ms_count{key="1",} 95.0
learner_handler_qp_time_ms_sum{key="1",} 3.0

Do you know what reporter function can I use to record these summary metrics?

@k24dizzle
Copy link

I believe this is a limitation with how Tally ingests histogram metrics in the Scope.

tally/stats.go

Line 367 in ecb4ac9

func (h *histogram) RecordValue(value float64) {

When a histogram value is recorded in the scope, the scope will store/cache the value as the upper bound of the bucket it belongs in.

When the scope tells the Prometheus reporter to report these histogram metrics, it flushes out the bucket upper bound value instead of the actual observed value:

func (b cachedHistogramBucket) ReportSamples(value int64) {

As a result, the sum prometheus generates won't be accurate, but rather an approximation.

Here's a gist that shows this behavior:
https://gist.github.com/k24dizzle/54d6c45b95bdeb76bf82632280ee2bfa

To fix this, there could be a workaround by using a counter to keep track of a histogram_sum metric instead? And overwriting the sum metric that the prometheus histogramVec emits?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants