Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Spans on_close method is called with 45 minutes delay #2998

Open
Matthias247 opened this issue Jun 5, 2024 · 3 comments
Open

A Spans on_close method is called with 45 minutes delay #2998

Matthias247 opened this issue Jun 5, 2024 · 3 comments

Comments

@Matthias247
Copy link

Matthias247 commented Jun 5, 2024

Bug Report

Version

tracing v0.1.40
tracing-attributes v0.1.27 (proc-macro)
tracing-core v0.1.32
tracing-subscriber v0.3.18
tracing-error v0.2.0 (*)

Platform

Ubuntu 22.04.3 LTS

Description

I'm running an application which creates spans in an async loop. In a simplified fashion it looks like:

impl Controller {
    pub async fn run(mut self) {
        loop {
            self.run_single_iteration().await;
            tokio::time::sleep(sleep_time).await;
        }
    }

    pub async fn run_single_iteration(&mut self) -> IterationResult {
        let controller_span = tracing::span!(
            tracing::Level::INFO,
            "iteration",
            app_timing_start_time = format!("{:?}", chrono::Utc::now()),
            app_timing_end_time = tracing::field::Empty,
        );

        let res = self
            .do_work()
            .instrument(controller_span.clone())
            .await;

        controller_span.record("app_timing_end_time", format!("{:?}", chrono::Utc::now()));

        res
    }
}

For handling Spans, I there is a custom Layer installed. That Layer - among other things - prints the start time of the span as well as the elapsed time when the Span finishes (closes). The intention is to log how long the execution of certain Spans takes.

struct SpanData {
    start_time: chrono::DateTime<chrono::Utc>,
}

impl<S> Layer<S> for MyFmtLayer<S>
where
    S: Subscriber + for<'span> LookupSpan<'span>,
{
    fn on_new_span(&self, attrs: &Attributes<'_>, id: &span::Id, ctx: Context<'_, S>) {
        let mut data = SpanData::new();
        let span = ctx.span(id).expect("Span not found, this is a bug");
        let mut extensions = span.extensions_mut();
        extensions.insert(data);
    }

    fn on_close(&self, id: span::Id, ctx: Context<'_, S>) {
        let span = ctx.span(&id).expect("Span not found, this is a bug");
        let mut extensions = span.extensions_mut();
        let Some(mut data) = extensions.remove::<SpanData>() else {
            return;
        };
        drop(extensions);

        let end_time = chrono::Utc::now();
        let elapsed = end_time.signed_duration_since(data.timing.start_time);
        // Also prints all span attributes, but I omitted this for brevity
        writeln!(
            "timing_start_time={:?} timing_end_time={:?} timing_elapsed_us={}",
            data.start_time,
            end_time,
            elapsed.num_microseconds().unwrap_or_default().to_string());
    }
}

With this setup, I got the following types of logs (us are reformatted for s here)

timing_start_time=2024-06-04T20:59:39.727760019Z timing_elapsed_s=10 timing_end_time=2024-06-04T20:59:50.467686565Z app_timing_start_time=2024-06-04T20:59:39.727754033Z app_timing_end_time=2024-06-04T20:59:50.467673603Z

timing_start_time=2024-06-04T20:05:13.970649727Z timing_elapsed_s=3241 timing_end_time=2024-06-04T20:59:15.770090934Z app_timing_start_time=2024-06-04T20:05:13.970639435Z app_timing_end_time=2024-06-04T20:05:25.315975078Z

timing_start_time=2024-06-04T20:58:29.879628853Z timing_elapsed_s=10 timing_end_time=2024-06-04T20:58:40.165925709Z app_timing_start_time=2024-06-04T20:58:29.879619323Z app_timing_end_time=2024-06-04T20:58:40.165911190Z

The 2nd log line here is not as expected:

  • In the actual application code (which writes the fields app_timing_start_time and app_timing_end_time) the span starts at 20:05:13 and ends at 20:05:31 (18s execution time).
  • The custom Layer however reports a timing_end_time of 20:59:15 (3241s execution time).
  • In addition to that, the emission of this log line also happen with the same amount of delay. It gets reordered behind the log lines of approximately 100 other iterations.

All in all it looks like the Spans on_close method is called with 45min delay, while the on_close method for other Spans within this async loop is called immediately just as expected.

Is there any async/deferred processing in tracing_subscriber that could lead to this?

@mladedav
Copy link
Contributor

mladedav commented Jun 5, 2024

The only thing that I know of that could cause this is if something has a reference to the span. This includes also children spans.

I believe something like this can theoretically happen if one of the async tasks spawns a long running task which would keep the parent alive. I think this can happen for example with long-running connections if they're instrumented as child spans.

Could you try running it again but collecting these timings for all spans and ideally also printing parent spans? That could tell you if this is the case because the long running task's span would end right before your own if this really is the case.

@Matthias247
Copy link
Author

Thanks for the response! Unfortunately I can't see another span directly closing before that in my logs. I guess my Layer might not log it since it is suppressed (there are filters like "hyper=error") - even though I am not sure if these filters prevent child span creation or just suppress the events in these child spans.

Is there a way how I could check in other lifecycle methods (like on_exit) what span references exist?
I'm sure theres none from the application code given the Span there really just exists on "the stack", but it seems like some library code might add a reference.

@mladedav
Copy link
Contributor

mladedav commented Jun 9, 2024

It depends on where you're creating the filters. If they are global, the spans should not be created at all, if the filters are set only on the given layer, then the spans will still exist and the layer is just ignoring them.

I don't think you can check references to a given span. You could theoretically iterate all spans to get its children (I think we store only the parent information, not the other way around) and there is no way to tell what holds a reference to a given span.

Do you have the code publicly available?

// global filter
tracing_subscriber::registry()
    .with(fmt::layer())
    .with(EnvFilter::from_default_env())
    .init();

// per-layer filter
tracing_subscriber::registry()
    .with(
        fmt::layer()
            .with_filter(EnvFilter::from_default_env())
    )
    .init();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants