-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodal projections maxing out on ABL calculation #859
Comments
This sounds like we are seeing for the blade resolved cases as well. @ashesh2512 @PaulMullowney @marchdf |
There is a problem in the amr-wind boundary conditions. I hope that it's the same problem. |
The nodal projections maxing in the middle of a GPU simulation has always been an issue for the large blade-resolved runs. I have observed the issue for over a year now. |
This is a pretty high priority issue for AWAKEN. They have several runs planned for Summit before the ALCC allocation is up in the next few weeks. If anyone has ideas please jump on this. |
@asalmgren Wanted to get this on your radar. |
@lawrenceccheung -- could you do some additional git bisection to see which git ommit breaks things? |
Yes, the latest bisection I did shows that the problem is happening somewhere between bbe0fdd and a75d2ec. I also tried the very latest commit (4b71037), and that also maxes out on the nodal projection. However, the more frustrating thing I've found is that this problem seems to have a random element to it. On a commit that I thought was working (9eb5e61, from Phil's b/awaken-runs branch), I resubmitted the exact same job with the same executable, and something that was working before is now maxing out on nodal_projections. Is there some Summit hardware component to this issue? Commits that were never working seem to be consistently failing, though. Lawrence |
@lawrenceccheung - when you run those specific commits are they always run
with the amrex version in the submodule, or do you sometimes run with an
external amrex?
If everything about the commits -- including version of amrex and
amrex-hydro -- is the same but it now fails, that does suggest hardware
and/or system software.
…On Wed, Jun 28, 2023 at 9:34 AM lawrenceccheung ***@***.***> wrote:
Yes, the latest bisection I did shows that the problem is happening
somewhere between bbe0fdd
<bbe0fdd>
and a75d2ec
<a75d2ec>.
I also tried the very latest commit (4b71037
<4b71037>),
and that also maxes out on the nodal projection.
However, the more frustrating thing I've found is that this problem seems
to have a random element to it. On a commit that I thought was working (
9eb5e61
<9eb5e61>,
from Phil's b/awaken-runs branch), I resubmitted the exact same job with
the same executable, and something that was working before is now maxing
out on nodal_projections. Is there some Summit hardware component to this
issue? Commits that were never working seem to be consistently failing,
though.
Lawrence
—
Reply to this email directly, view it on GitHub
<#859 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YXSQ4UUP4JMS4LUB2DXNRMLBANCNFSM6AAAAAAZMZNPE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
@asalmgren -- they're run with the amrex library as a submodule. Everything's been built with spack-manager. Lawrence |
Ok -- I'm assuming AMReX-Hydro is also a submodule.
What is your current perspective -- that there is a code change that broke
something or that a change in the system has broken or exposed something?
…On Wed, Jun 28, 2023 at 10:00 AM lawrenceccheung ***@***.***> wrote:
@asalmgren <https://github.com/asalmgren> -- they're run with the amrex
library as a submodule. Everything's been built with spack-manager.
Lawrence
—
Reply to this email directly, view it on GitHub
<#859 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YRPUU3UN6PAUFQ73S3XNRPKVANCNFSM6AAAAAAZMZNPE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
Yes, everything is a submodule. My perspective is that something changed over the last two months -- either in the ExaWind code, or Summit hardware, or both -- which is causing the bottom solver to not converge. I'd like to eliminate the ExaWind code as a possible source of the problem, there are some commits which seem to always fail, so if we can get to a commit that at least works part of the time, we can go from there. |
Ok, so let's go back to the git bisection approach, but maybe the test for
"works" vs "fails" needs to be based on multiple runs rather than a single
run?
Can you identify a single commit where -- on today's hardware and software
stack -- things go from "mostly working" to "mostly failing"?
When you say it's the bottom solver that is failing, is that hypre or the
amrex default BiCG?
…On Wed, Jun 28, 2023 at 11:14 AM lawrenceccheung ***@***.***> wrote:
Yes, everything is a submodule. My perspective is that something changed
over the last two months -- either in the ExaWind code, or Summit hardware,
or both -- which is causing the bottom solver to not converge. I'd like to
eliminate the ExaWind code as a possible source of the problem, there are
some commits which seem to always fail, so if we can get to a commit that
at least works part of the time, we can go from there.
—
Reply to this email directly, view it on GitHub
<#859 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YSLJOQSSIH3WNNMLL3XNRYA7ANCNFSM6AAAAAAZMZNPE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
These cases I'm running are without hypre, so with the amrex defaults. Another data point we just got is that if we run just a simple, single-level precursor: I'll continue bisecting to see if I can isolate the problem down to a single commit, but obviously we will need to run multiple times to get a sense of whether things are truly working or not. Lawrence |
These symptoms are identical to what Marc and I debugged last week. |
Lawrence -- are you running the current version with all of Paul's fixes?
…On Wed, Jun 28, 2023 at 12:37 PM PaulMullowney ***@***.***> wrote:
These symptoms are identical to what Marc and I debugged last week.
—
Reply to this email directly, view it on GitHub
<#859 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YQLTYOZOLE3RHWUIRDXNSBVXANCNFSM6AAAAAAZMZNPE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
Specifically, you need 5ae3533 |
Yes, I tried out 4b71037 which includes Paul's fixes. Just to keep everyone up to date, I talked with @PaulMullowney and @psakievich earlier, and I'm going to get some debug information from the nodal projection operation to help diagnose things. We will also try this problem with a CPU-only build on Summit to see if that has different behavior. Lawrence |
More data for those interested in this problem. Here's the verbose output from the nodal projection:
This is run with 9eb5e61, which is based off of the b/awaken-runs branch that Phil put together. Now, what's super interesting is that I've got a production run (same ABL setup, but with OpenFAST turbines) going right now, running simultaneously with this debug case on Summit, and using the same exe, but that one so far has no issues with the bottom solver (knock on wood). I'm not sure how to explain any of this behavior. Lawrence |
I recently re-ran the simple case Lawrence mentioned previously (https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/KingPlains_stable_precursor9.inp) on Summit (with GPUs), but with two added levels of refinement. With the latest amr-wind build, neither the nodal or MAC projections maxed out. |
@alhs6577 would you comment on your build process? It would be good to see if we can reproduce this. @lawrenceccheung has seen builds that work for a set of runs and then suddenly stop converging so there appears to be an intermittent nature to this issue. |
@psakievich I used one of the latest Summit builds from @lawrenceccheung so he would be the person to ask. |
In case it helps, I put together a log of different runs I was doing to try and bisect the case. I will keep adding to this list with more data points.
|
@asalmgren @PaulMullowney @psakievich it occurred to me that we might have a way to determine if this is a software or a hardware issue. I have an old executable from commit f92aae1, compiled on April 7, which previously hasn't shown any issues with the nodal projections. We can run a debug test case with this exe many times (say 10 times), and if there aren't any issues with the bottom solver on Summit, then something must have happened to code itself to cause these changes. Lawrence |
Just to be clear -- when you say "hardware" I'm assuming you're including
the system software, e.g. version of ROCm, etc?
…On Thu, Jun 29, 2023 at 1:55 PM lawrenceccheung ***@***.***> wrote:
@asalmgren <https://github.com/asalmgren> @PaulMullowney
<https://github.com/PaulMullowney> @psakievich
<https://github.com/psakievich> it occurred to me that we might have a
way to determine if this is a software or a hardware issue. I have an old
executable from commit f92aae1
<f92aae1>,
compiled on April 7, which previously hasn't shown any issues with the
nodal projections. We can run a debug test case with this exe many times
(say 10 times), and if there aren't any issues with the bottom solver on
Summit, then something must have happened to code itself to cause these
changes.
Lawrence
—
Reply to this email directly, view it on GitHub
<#859 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6YURHNL2OYZQUJCFR2LXNXTUPANCNFSM6AAAAAAZMZNPE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ann Almgren
Senior Scientist; Dept. Head, Applied Mathematics
Pronouns: she/her/hers
|
yes, I'm including system software when I say hardware. Although, I haven't changed the way I compile amr-wind in the last 6 months or so -- they should all be using gcc 10.2.0 toolset (see https://github.com/sandialabs/spack-manager/blob/main/configs/summit/compilers.yaml) -- so hopefully that's not a factor |
Another interesting data point. I just tried a run using 9537522, which is based off the most recent branch 4b71037 with additional radar scan functionality. That failed, but the bottom solver failed differently than before. Normally when the nodal projections max out, it does so consistently after the first few iterations, like so:
However, with this radar functionality built in, it will fail intermittently:
Not sure what to make of that either, but it case it helps anything. |
Latest run with CPU's only (July 7 with e97f8472 above), also failed, but please note -- it failed (as in core dumped) on the first MAC projection step, not on the nodal projection step. So something else is going on too. Lawrence |
Hey Lawrence -- can you turn on the verbosity and see where in the MAC it failed? |
@asalmgren I'm away right now, but I can get more details on where the MAC was failing next week. It's actually something I've seen in multiple cases now, so we might need to put a separate issue request on it. Looping in @alhs6577 and @ndevelder here too -- we can put together a series of cases where we've seen MAC issues. |
This issue is stale because it has been open 30 days with no activity. |
This issue is stale because it has been open 30 days with no activity. |
This issue is stale because it has been open 30 days with no activity. |
@lawrenceccheung Is this still an issue? It went away for me a few months back. It is concerning that I don't know what fixed it. |
This issue is stale because it has been open 30 days with no activity. |
@ashesh2512 I will be testing this out again soon. Lawrence |
This issue is stale because it has been open 30 days with no activity. |
@lawrenceccheung is going to rerun this case and see if latest updates have fixed this. |
@lawrenceccheung any luck rerunning these and seeing if this got fixed in that amrex update? |
Apologies, I didn't get a chance to repeat the original cases on Summit, however we have new anecdotal evidence to support this fix. We had a hybrid-solver FSI case (based on this configuration) where the amr-wind solver was maxing out on the linear solver. After switching to the latest amr-wind version with the amrex update, the MLMG solver worked fine. This case was run on Frontier, and we haven't seen any mac or nodal projection issues since the update. Lawrence |
OK that's good news. I will close this issue. If this comes up again, let's create a new issue. |
I am re-running an ABL case as a part of AWAKEN, and with the latest build of amr-wind (a75d2ec) the nodal_projections are maxing out. This is a case that I've run before, but I'm adding in different sampling planes. You can see the basic configuration here: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/UnstableABL_farmrun1/UnstableABL_farmrun1_noturbs.inp, and the last time I ran this the, both the nodal projections and MAC projections only required 8 iterations per timestep.
I tried this case with a slightly older build of amr-wind (185c360) from April, and the case is working fine with that exe. So sometime between then and now, something was introduced which affected the ABL solver. I'll continue trying to find which commit is causing the issue, but I'm curious if anybody else is seeing this problem.
Lawrence
The text was updated successfully, but these errors were encountered: