Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in in SolverAlgorithm #1037

Open
psakievich opened this issue Sep 9, 2022 · 14 comments
Open

Memory leak in in SolverAlgorithm #1037

psakievich opened this issue Sep 9, 2022 · 14 comments
Assignees

Comments

@psakievich
Copy link
Contributor

We've had a memory leak showing up in the nightly dashboard for a while. https://my.cdash.org/test/61728724

This is the same leak in the unit tests and I've traced it to SolverAlgorithm.C:L60.

   4: Indirect leak of 51200 byte(s) in 80 object(s) allocated from:                                                                                                                                                                                                                                                                                                      
   4:     #0 0xc5993f in malloc /projects/cde/v3/cee/spack/var/spack/stage/spack-stage-llvm-12.0.1-4ld5bciw4oetcntm2hgn23pklymf55ou/spack-src/compiler-rt/lib/asan/asan_malloc_linux.cpp:145:3                                                                                                                                                                            
   4:     #1 0x7f663a2dd482 in Kokkos::HostSpace::impl_allocate(char const*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const (/home/psakiev/soft/spack-manager/spack/opt/spack/linux-rhel7-x86_64/clang-12.0.1/trilinos-stable-xvbe2e6a3e3tbckfyb2r3ap3cz2twdr2/lib/libkokkoscore.so.13+0x3e482)                                                        
   4:     #2 0x2212a6a in sierra::nalu::NGPApplyCoeff::NGPApplyCoeff(sierra::nalu::EquationSystem*) /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithm.C:60:39                                                                                                                                                                                                    
   4:     #3 0x206055e in sierra::nalu::SolverAlgorithm::coeff_applier() /scratch/psakiev/clang-debug/nalu-wind/include/SolverAlgorithm.h:89:42                                                                                                                                                                                                                           
   4:     #4 0x206055e in sierra::nalu::AssembleFaceElemSolverAlgorithm::execute() /scratch/psakiev/clang-debug/nalu-wind/src/AssembleFaceElemSolverAlgorithm.C:93:23                                                                                                                                                                                                     
   4:     #5 0x2216d8b in sierra::nalu::SolverAlgorithmDriver::execute() /scratch/psakiev/clang-debug/nalu-wind/src/SolverAlgorithmDriver.C:123:18                                                                                                                                                                                                                        
   4:     #6 0x215f8fd in sierra::nalu::EquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystem.C:314:21                                                                                                                                                                                                      
   4:     #7 0xd19545 in sierra::nalu::MomentumEquationSystem::assemble_and_solve(stk::mesh::FieldBase*) /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:2674:19                                                                                                                                                                                       
   4:     #8 0xcee0f3 in sierra::nalu::LowMachEquationSystem::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/LowMachEquationSystem.C:743:23                                                                                                                                                                                                                
   4:     #9 0xccd7bb in sierra::nalu::EquationSystems::solve_and_update() /scratch/psakiev/clang-debug/nalu-wind/src/EquationSystems.C:794:12                                                                                                                                                                                                                            
   4:     #10 0xeadc2b in sierra::nalu::Realm::nonlinear_iterations(int) /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1948:47                                                                                                                                                                                                                                       
   4:     #11 0xea9bd8 in sierra::nalu::Realm::advance_time_step() /scratch/psakiev/clang-debug/nalu-wind/src/Realm.C:1933:3                                                                                                                                                                                                                                              
   4:     #12 0xcb7cf2 in sierra::nalu::TimeIntegrator::integrate_realm() /scratch/psakiev/clang-debug/nalu-wind/src/TimeIntegrator.C:392:16                                                                                                                                                                                                                              
   4:     #13 0xc964ec in main /scratch/psakiev/clang-debug/nalu-wind/nalu.C:193:9
@psakievich
Copy link
Contributor Author

@ldh4 this looks like it could be related to work you did in #990?

@ldh4
Copy link
Contributor

ldh4 commented Sep 9, 2022

I can't seem to access the dashboard right now. I am getting:

Network Error (tcp_error)

A communication error occurred: ""
The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time.

Basically, I'd like to know when this leak started to show up.

#987 was the fix for the memory leak revolving around NGPApplyCoeff that led to the segfault, fixed by #990.
I believe I checked that the leak isn't back when I pushed #990, but I can check again.

@psakievich
Copy link
Contributor Author

psakievich commented Sep 9, 2022

@ldh4 it goes back as far as the dashboard stores information. Which admittedly isn't very far. I'm setting up on my mac now to see if I can reproduce there. I was able to reproduce on ascicgpu22 with the spec nalu-wind@master+asan+openfast+hypre build_type=RelWithDebInfo.

image

@psakievich
Copy link
Contributor Author

I do not see this memory leak on my mac with asan turned on. Interesting.

@psakievich
Copy link
Contributor Author

psakievich commented Sep 12, 2022

@ldh4 I've been reviewing this further with @PaulMullowney and I think there are some design changes from #937 that need to be refactored to unify TeptraLinearSystem and HypreLinearSystem's CoeffApplier implementations. We ran the hypre only test for this case and it does not show memory leaks. I suspect the root cause of these leaks are a missing free for the CoeffApplier, but it would make things less confusing if we unified the design a little more.

In particular we think the device_pointer function(s) in TpetraLinearSystem should be restored

  • from
    void free_device_pointer(){};
    sierra::nalu::CoeffApplier* device_pointer() { return nullptr; };
  • to
    void TpetraLinearSystem::TpetraLinSysCoeffApplier::free_device_pointer()
    {
    #ifdef KOKKOS_ENABLE_CUDA
    if (this != devicePointer_) {
    sierra::nalu::kokkos_free_on_device(devicePointer_);
    devicePointer_ = nullptr;
    }
    #endif
    }
    sierra::nalu::CoeffApplier* TpetraLinearSystem::TpetraLinSysCoeffApplier::device_pointer()
    {
    #ifdef KOKKOS_ENABLE_CUDA
    if (devicePointer_ != nullptr) {
    sierra::nalu::kokkos_free_on_device(devicePointer_);
    devicePointer_ = nullptr;
    }
    devicePointer_ = sierra::nalu::create_device_expression(*this);
    #else
    devicePointer_ = this;
    #endif
    return devicePointer_;
    }

and the asymmetry in the NGPApplyCoeff's method for removing memory

Pinging @alanw0 so we can get this on the STK team workload.

@ldh4
Copy link
Contributor

ldh4 commented Sep 12, 2022

I agree that TpetraLinearSystem should be refactored further, specifically regarding CoeffApplier. The main difficulty I noticed while modifying TpetraLinearSystem for #937 was from Tpetra's restriction for variable reference holding, which led us to make this workaround for how CoeffAppliers are created and destroyed. It's something we should take another look at.

@psakievich
Copy link
Contributor Author

@ldh4 yes that sounds good to me. It might be worth reviewing the variable reference issues with @PaulMullowney in a mob programming session so we can get his perspective from the HypreLinearSystem work he's done.

@alanw0
Copy link
Contributor

alanw0 commented Sep 13, 2022

ok we'll get a story on the backlog to take care of this.

@michaelasprague
Copy link
Contributor

@alanw0 -- is there any news on this issue?

@alanw0
Copy link
Contributor

alanw0 commented Jul 8, 2024

No. This may have been lost in the rush to get things working on crusher/frontier...
@ldh4 What was the resolution of this issue?

@ldh4
Copy link
Contributor

ldh4 commented Jul 8, 2024

@alanw0 I don't know we made a strong action to further resolve this.
Based on my memory, we had concluded that the differences in implementation designs between HypreLinearSystem and TpetraLinearSystem are partially contributing to this recurring memory leak/segfault issue.

I recall there being further discussions that HypreLinearSystem is the favored one at the time and that TpetraLinearSystem should be refactored to have similar code flow as HypreLInearSystem, which would have dissolved this recurring issue. But I do not know if that refactoring took a place eventually.

@alanw0
Copy link
Contributor

alanw0 commented Jul 8, 2024

@michaelasprague Dong Hun's comment confirms my suspicion that this memory-leak fix got lower priority when we removed Trilinos solvers from the default build as we were prioritizing the AMD/ROCm work for crusher/frontier.
Shall we close this issue as "won't fix"?

@michaelasprague
Copy link
Contributor

I defer to @psakievich on if it belongs in won't fix

@psakievich
Copy link
Contributor Author

I think backlogged would be a better choice over won't fix. Currently the vast majority of our tests are configured with tpetra solvers and our linear solvers bench is shallower than it used to be.

Unless we plan to totally abandon tpetra solvers I would say a plan should be made to fix this and it would be good to assign to the next person who steps into the role of linear solvers lead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants