-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of amr-wind #1097
Comments
Hi, thanks for reaching out! The short answer: strong scaling for a code that spends most of its time in linear solvers (as amr-wind does) can be very difficult in general. However, there are certain things you could do to get the most performance out of your case:
We don't have good guidance for your case because we typically don't spend much time profiling at this scale. And these things vary quite a bit machine-to-machine. We do spend a lot of time thinking about code performance for GPUs and for O(10-100k) MPI ranks and have some better ideas for the types of numbers that will lead to better performance. After all this, if the code is still not fast enough, then we need to start talking about linear solver input parameters. |
@Armin-Ha -- just to follow up -- when you have a chance to re-run with the profiling on could you send us the output files (maybe just from 1, 4 and 8 cores). Also -- it looks like you do have the checkpointing and plotfiles on -- could you turn those off before re-running? And feel free to run fewer steps -- if I'm correctly reading your inputs file you are running over 14000 steps and writing plotfiles/checkpoints roughly 28 times? See what happens if you maybe run 100 steps for each case with all the I/O off? Thx |
Hi Ann, Thanks for the reply. I conducted the simulations for around 50 steps, so no checkpointing or file plotting was involved except for the initial time. As you know, AMR-wind outputs the total time for every single time step, and I have taken the average of these times for 50 steps to exclude the writing time for the initial checkpoint and plot files. I will re-run the cases as you'd like and send you the output files. In addition, I will examine Marc's suggestions to improve the performance. |
Sounds great, thanks! The most important thing for me to look at it will
be the profiling results that are printed at the end of the run.
To clarify - did you run with 512 grids for each run? Or use fewer
(larger) boxes at lower core counts?
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
…On Thu, Jun 13, 2024 at 1:50 AM Armin-Ha ***@***.***> wrote:
Hi Ann, Thanks for the reply. I conducted the simulations only for around
50 steps, so no checkpointing or fileplotting were involved except for the
initial time. As you know, AMR-wind outputs the total time for every single
time step, and I have taken the average of these times for 50 steps to
exclude the writing time for the initial checkpoint and plot files. I will
re-run the cases as you wish and provide you with the output files. In
addition, I will examine Marc's suggestions to improve the performance.
—
Reply to this email directly, view it on GitHub
<#1097 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YSBFL2FKK5SPAITIGDZHFMM7AVCNFSM6AAAAABJDUVSM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRVGAZTENZYHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I will provide you with the profiling results. Throughout the study, I maintained a fixed mesh of 256x256x256 cells with a fixed domain size of 2560x2560x1280 m3. The only variable I modified among different simulations was the number of cores. Best regards, |
Hi @Armin-Ha, For comparison, here are some strong scaling results of AMR-Wind that we've observed (the plots are time per timestep, which can be converted to the speedups calculated). This is on a 512 x 512 x 512 ABL case using CPU's and GPU's of the Frontier cluster. The details of the hardware are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#system-overview, but the CPU's are AMD 3rd Gen EPYC processors. Let me know if you have any questions. Cheers, Lawrence |
Hi @asalmgren and @lawrenceccheung, Sorry for my late reply, and thanks for sharing the strong scaling results of AMR-Wind, which appear to be reasonably linear on AMD 3rd Gen EPYC processors. I would appreciate it if you could provide me with the input file used for this analysis. I have replicated the analysis for the small spinup simulation with 256X256X256 mesh (10mX10mX5m) on Intel Xeon W-2145. The corresponding log files, which include the profiling outcomes, are attached. log_1cores.txt Best regards, |
Thanks for the update. @lawrenceccheung do you have the input file for Armin to try? I am running some local tests on my machine to see if there are better settings for your specific case. I will be out for the next week or so though. |
Hi @Armin-Ha, Yes, you can try running the 512x512x512 that I used here: https://github.com/lawrenceccheung/ALCC_Frontier_WindFarm/blob/main/precursor/scaling/Baseline_level0/MedWS_LowTI_precursor1.inp. Just set Lawrence |
Minor update. I ran @Armin-Ha's case on a local machine (AMD EPYC-Rome Processor). And I get the following for strong scaling: notes:
This, with the data @lawrenceccheung presented at a much higher proc count, seems to me to indicate that there is not much of a strong scaling problem on CPUs with amr-wind. |
Wanted to share an observation. The results from @lawrenceccheung above show excellent strong scaling down to about 34,000 cells per rank, whereas results from @marchdf show good strong scaling down to only about 500,000 cells per rank. These are different cases and different machines, but it seems that more performance could be gained in @Armin-Ha 's case. |
Just to add to the discussion, I have a test case that I have been using to test compilations on Kestrel. My test case has about 50M cells and two refinements. I run it with and without two ALM turbines. Performance starts to drop at ~120k cells per rank, and lower than that is really not good. I understand my cases and number are different than the ones you are all discussing, and most importantly, I'm running on Kestrel, but just wanted to share what I found. I built my test case from a user point of view and is supposed to mimic a real case I would run, hence the turbines and refinements. Note that I have also not changed Edit: this is all on CPUs. |
As an FYI to all on this thread -- amrex has actually just changed the
default max_grid_size for 3D runs on GPUs from 32 to 64 -- some amount of
performance testing seems to indicate that's a win. Your mileage may vary
of course.
We actually suggest the same for CPU runs but thought it would be less
disruptive for users to do this for GPU-only first and see if there any
gotchas.
…On Wed, Jul 10, 2024 at 11:40 AM Regis Thedin ***@***.***> wrote:
Just to add to the discussion, I have a test case that I have been using
to test compilations on Kestrel. My test case has about 50M cells and two
refinements. I run it with and without two ALM turbines. Performance starts
to drop at 60k cells per rank, and lower than that is really not good. I
understand my cases and number are different than the ones you are all
discussing, and most importantly, I'm running on Kestrel, but just wanted
to share what I found. I built my test case from a user point of view and
is supposed to mimic a real case I would run, hence the turbines and
refinements. Note that I have also not changed amr.max_grid_size from its
default.
image.png (view on web)
<https://github.com/Exawind/amr-wind/assets/13243358/ba5c8699-340a-4409-82f2-f33576a8c11f>
—
Reply to this email directly, view it on GitHub
<#1097 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YQWIFVBG2BWKJ5LKA3ZLV52DAVCNFSM6AAAAABJDUVSM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGE4TENJYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
For this issue, I don't want to discuss the weirdnesses of CPU runs on Kestrel. There are known issues with that machine and those are being worked on. For the case that opened up this issue, we've shown that it can scale on CPUs. And on a machine we trust (Frontier), we get good scaling on a large (similar) case. From @Armin-Ha's reaction to my post about his case, we can close his issue. Please feel free to reopen @Armin-Ha if you need to discuss further. |
Hi all,
I have conducted a strong-scaling analysis for a small spinup simulation with 256X256X256 mesh (10mX10mX5m) on two different machines with different CPUs. I am concerned about the poor performance observed in this analysis and would appreciate any insights on the matter. The corresponding input file and a log file are attached.
log.txt
spinup.txt
Best regards,
Armin
The text was updated successfully, but these errors were encountered: