Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete a profile of our code #341

Open
Aaron3154518 opened this issue Oct 9, 2019 · 5 comments
Open

Complete a profile of our code #341

Aaron3154518 opened this issue Oct 9, 2019 · 5 comments
Assignees

Comments

@Aaron3154518
Copy link
Contributor

Do Profile with Callgrind, then find expensive operations and report them here and do some analysis with what you find.

@jaredwhite547
Copy link
Contributor

jaredwhite547 commented Oct 17, 2019

90.10%  find_peaks
Expected
    ~4M iterations

39.63%  gsl_multifit_nlinear_driver

14.02%  func_df

10.76%  func_f

6.47%   Malloc

5.87%   gsl_rng_alloc
    Not used anywhere
    6% of total find_peaks time

3% gsl_vector_get
    Has range checking by default
    Consider disabling with macro at -O3 or something

spdlog Note:
    Not sure if current function calls are what we want -- ex should we be using SPDLOG_TRACE instead of spdlog::trace

Recommendations:
    Threading the implementation of fit_data should be fairly trivial. The problem appears to be a single producer, multiple consumer problem, with a few minor details. Rough Implementation:
        We have n worker threads. They each contain a queue.
        The main thread adds pulses to each thread's queue one by one.
            EX: Add pulse to thread 0, then thread 1, ..., thread n-1, thread 0, thread 1, etc
        Worker threads dequeue their item, and process them (guess/find peaks, etc). They put their list of peaks into an internal vector.
        Join all threads.
        All worker threads have now been destroyed.
        The main thread takes all the entries from the internal vectors and combines them. Ordering is preserved.
        Done.
    The problem could be slightly modified to supposed an arbitrarily large number of threads, with a single coalescing stage at the end.
        Each instance of find_peaks is independent, except for this final stage.

    Potential quirks:
        We omit details regarding conditional variables and some other small details required for an optimal implementation
        Capping the amount of queued pulses is likely desirable, due to their size.

Aside:
    There are a couple of loops in find_peaks used for logging that happen regardless of the actual logging level

Posted as code to preserve formatting.
@cathieO

Callgrind file can be found here. I recommend using KCacheGrind to view it.

I am 95% sure the dataset I used was 140823_152426_2
I am around 75% sure the exact command I ran was :
valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --cache-sim=yes --branch-sim=yes --dump-line=yes ./bin/geotiff-driver -f etc/140823_152426_2.pls -w 13,11,14 -a 4,8,7 -e 1,2,3

@jaredwhite547
Copy link
Contributor

Re-run once TemplateRefactoring is merged into master.

@jaredwhite547
Copy link
Contributor

valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes

@nicholasprussen
Copy link
Contributor

nicholasprussen commented May 7, 2020

New Callgrind output can be found here for TemplateRefactoring

@nicholasprussen
Copy link
Contributor

New Callgrind output can be found here for Master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants