Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any way you can boost performance of building the process_map ? #13

Open
trikiamine23 opened this issue Feb 17, 2020 · 8 comments
Open
Assignees

Comments

@trikiamine23
Copy link

This tool is great, very great actually.
But is it possible to add multi-processing in the process_map function:
Here you can see (in this photo) that during the build of the dataframe nodes and edges, it is using only one processor. If you add the multiprocessing part it will be very fast, and we can deploy this package in our servers.
I have nearly 3 Millions (300 uniques) rows.
image

@trikiamine23 trikiamine23 changed the title any way you can boost performance of building the process_map any way you can boost performance of building the process_map ? Feb 17, 2020
@fmannhardt
Copy link
Member

I did some work on improving the performance of the data preparation in the process_map function last year by using data.table instead of dplyr in more places. This should use multiple threads where applicable.

There are certainly some parts which could be further optimised. Do debug your performance problem, we would need some more information on where exactly the bottleneck is. Could you try executing your code with the RStudio profiler activated and upload the saved profvis file? There are some options in using bupaR that can lead to performance degradation.

@fmannhardt fmannhardt self-assigned this Feb 17, 2020
@trikiamine23
Copy link
Author

My data input for the function process_map is a data.frame, eventlog
do I need to convert it to data.table before ?
I have the feeling I do not understand.

PS: trying to fix my profvis problem, get back with the result as soon as possible.

@fmannhardt
Copy link
Member

fmannhardt commented Feb 17, 2020

I looked at the profvis log you send me:

image

It looks like R is taking most of the time for garbage collection. That suggests you have to little memory to keep the full data (plus the computation). I will compare it with a normal situation tomorrow, but I think this cannot easily be improved except for having more memory.

data.table is currently only used inside and it is not possible to supply a data.table.

@trikiamine23
Copy link
Author

trikiamine23 commented Feb 18, 2020

Thank you very much for your response but this is my memory usage while running the process_map function:
image
It's some kind of loop that takes an eternity to finish (no results at the end for 3 Million rows with 300 uniques events).
I know it is not very wise to do so, but sometimes management would like to see the spaghetti shape of the processes.
PS: I did not put the validate option to TRUE (for the check conformance) in the eventlog function

@fmannhardt
Copy link
Member

PS: I did not put the validate option to TRUE (for the check conformance) in the eventlog function

That is probably a good idea since it would take a lot of time to validate the event log.

I see that the available memory should not be the issue. Just realised that I forgot to ask, do you use the current development version (installed from Github master) or the CRAN? Since, there are some improvements in the development version.

@trikiamine23
Copy link
Author

I use the CRAN version, I will test the development version.
Thank you

@trikiamine23
Copy link
Author

After getting a closer look, it was actually the SVG export that takes a lot of time
here is my code

grf %>%
     generate_dot() `%>%`
     grViz() %>%
     export_svg %>%
     svgPanZoom()

grf is a diagrammeR graph structure

@noamanemobidata
Copy link

Hello! considering the replacement of dplyr and data.table with duckdb as a potential strategy to enhance performance in the data preparation process. This change may contribute to improved efficiency and overall better performance as shown in the recent benchmark : https://duckdblabs.github.io/db-benchmark/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants