Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance outliers #579

Open
John-Nagle opened this issue Mar 1, 2024 · 5 comments
Open

Performance outliers #579

John-Nagle opened this issue Mar 1, 2024 · 5 comments

Comments

@John-Nagle
Copy link
Contributor

I'm running Tracy on my real Sharpview program now, to see what performance looks like. Here's an overview

redrawtrace2linear

This shows how long the "Redraw" event from Rend3-framework took. Mean time 15ms, which is great. Most of the frame times are close to that. But there are many outliers. Slowest time 110ms, and lots of frames took 50ms-70ms to render.

Basic rendering is probably fast enough. There's concurrent content loading in progress, and the outliers are probably related to that.

I have the whole trace file (25MB) saved, if you want that. Then you can look at it yourself with Tracy.

@John-Nagle
Copy link
Contributor Author

Looking at frame variation.

Frame 5274 took 78ms. No one thing dominates.

  • SwapChain::present took about 10ms, about 10x the usual time.
  • Tonemapping also took about 10ms, mostly in create_bind_group
  • update_gpu_textures also spent about 7ms in create_bind_group.

Frame 5340 took 17ms. Essentially the same content, but no content loading in progress.

Frame 5261 took 16ms. Best frame. Tonemapping only ran 4us.

Looking at device_create_bind_group in wgpu core, it's usually very fast, but sometimes
it is slow. It doesn't do anything with data-dependent size, but it does lock trackers, with
.device.trackers.lock().
Could lock contention with the update threads there be slowing the main thread?

Here's the fastest frame, 16ms:

fastestframe

Here's the slowest frame, 77ms:

slowframe2

@John-Nagle
Copy link
Contributor Author

Tracy performance trace files can be downloaded from https://animats.com/sl/misc/ The one above is vallone2.trace.tracy. Note, 25MB. This can be read with Tracy Profiler 0.10. That gives detailed access to the performance data.

@John-Nagle
Copy link
Contributor Author

Places that look slow in the slow example above:

  • SwapChain::present took 8ms in the slow frame, well under 1ms in the fast frame.
  • ToneMapping took 8ms in the slow frame, well under 1ms in the slow frame.
  • EvaluateInstruction spent 6ms in Update GPU TextureArrays, almost all of that in Device::create_bind_group

Most of the stalls appear to be associated with Device::create_bind_group.

@John-Nagle
Copy link
Contributor Author

Possible performance bottlenecks,with GPU culling removed:

Texture create view - seems that the loader threads can stall the render thread for a while at a texture create.

createviewstallcombined

Mesh create conflict on create buffer?

matlmgradd

None of these alone kill performance but each seems to add 5-10ms to a frame.

@John-Nagle
Copy link
Contributor Author

Looking at

https://github.com/gfx-rs/wgpu/blob/d4f30638b749f269feeb654d7880a7d845a2fff1/wgpu-core/src/device/global.rs#L785

Notes

  • That function acquires three locks, non-nested. One of those locks can be held for too long, but can't tell which one./
  • The "loop" does not seem to iterate; it's just to have an internal block that can be exited from the middle with "break".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant