Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of amr-wind #1097

Closed
Armin-Ha opened this issue Jun 11, 2024 · 14 comments
Closed

Poor performance of amr-wind #1097

Armin-Ha opened this issue Jun 11, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@Armin-Ha
Copy link

Hi all,

I have conducted a strong-scaling analysis for a small spinup simulation with 256X256X256 mesh (10mX10mX5m) on two different machines with different CPUs. I am concerned about the poor performance observed in this analysis and would appreciate any insights on the matter. The corresponding input file and a log file are attached.

AMR_performance
log.txt
spinup.txt

Best regards,
Armin

@Armin-Ha Armin-Ha added the enhancement New feature or request label Jun 11, 2024
@Armin-Ha Armin-Ha changed the title Poor performance of amr_wind Poor performance of amr-wind Jun 11, 2024
@marchdf
Copy link
Contributor

marchdf commented Jun 11, 2024

Hi, thanks for reaching out! The short answer: strong scaling for a code that spends most of its time in linear solvers (as amr-wind does) can be very difficult in general.

However, there are certain things you could do to get the most performance out of your case:

  • compile with the profiler on (tiny-profile: AMR_WIND_ENABLE_TINY_PROFILE:ON) so you can see where it is spending it's time
  • vary the input parameter amr.blocking_factor from 4 to 32 by powers of 2
  • vary the input parameter amr.max_grid_size from 4 to 256 by powers of 2
  • increase the amount of work per core with a bigger cell count
  • use an intel compiler
  • try theading with openmp

We don't have good guidance for your case because we typically don't spend much time profiling at this scale. And these things vary quite a bit machine-to-machine. We do spend a lot of time thinking about code performance for GPUs and for O(10-100k) MPI ranks and have some better ideas for the types of numbers that will lead to better performance.

After all this, if the code is still not fast enough, then we need to start talking about linear solver input parameters.

@asalmgren
Copy link
Contributor

@Armin-Ha -- just to follow up -- when you have a chance to re-run with the profiling on could you send us the output files (maybe just from 1, 4 and 8 cores). Also -- it looks like you do have the checkpointing and plotfiles on -- could you turn those off before re-running? And feel free to run fewer steps -- if I'm correctly reading your inputs file you are running over 14000 steps and writing plotfiles/checkpoints roughly 28 times? See what happens if you maybe run 100 steps for each case with all the I/O off? Thx

@Armin-Ha
Copy link
Author

Armin-Ha commented Jun 13, 2024

Hi Ann, Thanks for the reply. I conducted the simulations for around 50 steps, so no checkpointing or file plotting was involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you'd like and send you the output files. In addition, I will examine Marc's suggestions to improve the performance.
Best regards,
Armin

@asalmgren
Copy link
Contributor

asalmgren commented Jun 13, 2024 via email

@Armin-Ha
Copy link
Author

I will provide you with the profiling results. Throughout the study, I maintained a fixed mesh of 256x256x256 cells with a fixed domain size of 2560x2560x1280 m3. The only variable I modified among different simulations was the number of cores.

Best regards,
Armin

@lawrenceccheung
Copy link
Contributor

Hi @Armin-Ha,

For comparison, here are some strong scaling results of AMR-Wind that we've observed (the plots are time per timestep, which can be converted to the speedups calculated). This is on a 512 x 512 x 512 ABL case using CPU's and GPU's of the Frontier cluster.
image

The details of the hardware are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#system-overview, but the CPU's are AMD 3rd Gen EPYC processors. Let me know if you have any questions.

Cheers,

Lawrence

@Armin-Ha
Copy link
Author

Armin-Ha commented Jun 25, 2024

Hi @asalmgren and @lawrenceccheung,

Sorry for my late reply, and thanks for sharing the strong scaling results of AMR-Wind, which appear to be reasonably linear on AMD 3rd Gen EPYC processors. I would appreciate it if you could provide me with the input file used for this analysis.

I have replicated the analysis for the small spinup simulation with 256X256X256 mesh (10mX10mX5m) on Intel Xeon W-2145. The corresponding log files, which include the profiling outcomes, are attached.

log_1cores.txt
log_2cores.txt
log_4cores.txt
log_8cores.txt

Best regards,
Armin

@marchdf
Copy link
Contributor

marchdf commented Jun 27, 2024

Thanks for the update. @lawrenceccheung do you have the input file for Armin to try?

I am running some local tests on my machine to see if there are better settings for your specific case. I will be out for the next week or so though.

@lawrenceccheung
Copy link
Contributor

Hi @Armin-Ha,

Yes, you can try running the 512x512x512 that I used here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/scaling/Baseline_level0/MedWS_LowTI_precursor1.inp. Just set time.max_step or time.stop_time to something small to get a few iterations for the purposes of timing.

Lawrence

@marchdf
Copy link
Contributor

marchdf commented Jul 9, 2024

Minor update. I ran @Armin-Ha's case on a local machine (AMD EPYC-Rome Processor). And I get the following for strong scaling:

Screenshot 2024-07-09 at 1 30 03 PM

notes:

  • @Armin-Ha's data is in red.
  • I played with just one of the many parameters for tuning (amr.max_grid_size= 16, 32, 64)
  • Scaling is good until 5e5 cells per proc approximately
  • Increasing amr.max_grid_size improves runtime mostly but gets worse at low grid cell counts because it can't distribute the cells on all the procs better
  • @Armin-Ha's scaling is poor compared to these data. Bad MPI implementation? Bad procs? Not sure what is going on that system.
  • I would imagine playing with other parameters could get the scaling at low cells/core to be better. Maybe even playing with OMP.

This, with the data @lawrenceccheung presented at a much higher proc count, seems to me to indicate that there is not much of a strong scaling problem on CPUs with amr-wind.

@michaelasprague
Copy link

Wanted to share an observation. The results from @lawrenceccheung above show excellent strong scaling down to about 34,000 cells per rank, whereas results from @marchdf show good strong scaling down to only about 500,000 cells per rank. These are different cases and different machines, but it seems that more performance could be gained in @Armin-Ha 's case.

@rthedin
Copy link

rthedin commented Jul 10, 2024

Just to add to the discussion, I have a test case that I have been using to test compilations on Kestrel. My test case has about 50M cells and two refinements. I run it with and without two ALM turbines. Performance starts to drop at ~120k cells per rank, and lower than that is really not good. I understand my cases and number are different than the ones you are all discussing, and most importantly, I'm running on Kestrel, but just wanted to share what I found. I built my test case from a user point of view and is supposed to mimic a real case I would run, hence the turbines and refinements. Note that I have also not changed amr.max_grid_size from its default.

image

Edit: this is all on CPUs.

@asalmgren
Copy link
Contributor

asalmgren commented Jul 10, 2024 via email

@marchdf
Copy link
Contributor

marchdf commented Jul 12, 2024

For this issue, I don't want to discuss the weirdnesses of CPU runs on Kestrel. There are known issues with that machine and those are being worked on. For the case that opened up this issue, we've shown that it can scale on CPUs. And on a machine we trust (Frontier), we get good scaling on a large (similar) case.

From @Armin-Ha's reaction to my post about his case, we can close his issue. Please feel free to reopen @Armin-Ha if you need to discuss further.

@marchdf marchdf closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants