Skip to content
This repository has been archived by the owner on Sep 2, 2021. It is now read-only.

Experiencing some random hangs under heavy workload #47

Open
ltsdw opened this issue Aug 17, 2021 · 67 comments
Open

Experiencing some random hangs under heavy workload #47

ltsdw opened this issue Aug 17, 2021 · 67 comments

Comments

@ltsdw
Copy link

ltsdw commented Aug 17, 2021

I've been experiencing these hangs (where everything freezes for like 5 secs) when playing some games on wine that usually uses a lot of the CPU, sometimes when watching some videos.

To be sure that was cacule patch and nothing else I tested with the mainline arch kernel (no hangs). As I have some patches applied at my kernel I tried compiling it without the cacule patch (also no hangs). And then tried applying the cacule again and the hangs comes back.

I'm not quite sure. But I think that the commit that introduced it is the 06cb3974.

I didn't tried reverting the commit to test, only tested with these:

cacule-patch-with-hangs.txt - patch where hangs happens

cacule-without-hangs.txt - and without the hangs

But if needed I can try bisecting later to see exactly which commit causes it.

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Also all the tunable configs are the default.

@raykzhao
Copy link

Hi @ltsdw

Based on #43, Have you tried to reduce the kernel.sched_cache_factor to a lower value e.g. 0? Also from my experience, you may try to set the kernel.sched_cacule_yield to 0 since it may cause freeze due to some I/O issues, see #35.

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Hi there @raykzhao

Thank you for your suggestion I'll try.

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

sadly it didn't worked, tried:
kernel.sched_cache_factor=0
kernel.sched_cacule_yield=0

but the hangs still.

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 17, 2021

kernel.sched_cache_factor=0

Could you please also set
kernel.sched_starve_factor=0

Is RDB enabled?

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Could you please also set
kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

@hamadmarri
Copy link
Owner

Could you please also set
kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

Could you please try without RDB?

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 17, 2021

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

Sure this one here was from my last compile on 5.13.8 config.txt.

CPU: i5 5200U
GPU: Intel(R) HD Graphics 5500 (using iris)
RAM: 8 GB
Mesa: 21.3.0 (commit c0fc745b78b)
Wine: 6.13 (with some patches from tkg)
Games that I tested with: NovaRO, GTA San Andreas, Path of Exile (this one I'll blame my gpu more than anything else), but it also happen out of nowhere when watching some videos too, or when I'm compiling something.

and when you say settings, you say which ones? the cacule's ones? if it's, it's all the default.

Now let me recompile it, will take some time.

@JohnyPeaN
Copy link

I have such lags in rdr2 (only) and setting kernel.sched_interactivity_factor=50 seems to be helping. It doesnt happen without RDB, but without RDB background load has stronger negative effects. I will test kernel.sched_starve_factor=0, too.

@ltsdw
Copy link
Author

ltsdw commented Aug 17, 2021

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 17, 2021

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

Hi @ltsdw ,

Good to hear it's working fine now, however, I really would like to troubleshoot why RDB causes these freezes.

Regarding tunning, there is no specific way to test. I tried to make the defaults to work fine in general, but when you have any issue you can change them. You need to have a background on cpu scheduling so you can read about the every cacule sysctl and change them accordingly.

I would like to keep this issue open until we see why RDB performs bad with wine.

Thank you

@hamadmarri
Copy link
Owner

I suspect it is related to rcu calls and soft irq. I will post some fixes to try soon.

Thank you

@JohnyPeaN
Copy link

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 18, 2021

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

Hi @JohnyPeaN , @ltsdw

To narrow down the troubleshooting, could you please try RDB with:
CONFIG_HZ_PERIODIC=y
to see if it is actually related to no_hz_{idle, full} balancing?
I remember I had nohz_balancer_kick(rq); added in RDB before, but for some reasons that I forgot why I removed it from RDB trigger_load_balance function.

Also, can you try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Or try vise versa, in cause you have most rcu configs are disabled try to enable them.

Based on my RDB code review I have just did 2min ago, I am suspecting it is because nohz balancing. I am assuming that you are using no_hz_full?

Please let me know if any of the above changes fix the freezes so I can propose a fix based on your feedback. If non of the above configs has any positive effects, then I can investigate something else.

Thank you

@JohnyPeaN
Copy link

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

@hamadmarri
Copy link
Owner

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Hi @JohnyPeaN

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

Thank you

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

@hamadmarri

ok, I'll try too, but I'll need some time, thank you!

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

@hamadmarri

while compiling I noticed this:

kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance' [-Werror,-Wimplicit-function-declaration]
                nohz_newidle_balance(this_rq);
                ^
kernel/sched/fair.c:11324:3: note: did you mean 'nohz_run_idle_balance'?
kernel/sched/sched.h:2439:20: note: 'nohz_run_idle_balance' declared here
static inline void nohz_run_idle_balance(int cpu) { }
                   ^
1 error generated.
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

and the building failed.

@ltsdw

This comment has been minimized.

@ltsdw

This comment has been minimized.

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

@hamadmarri
Copy link
Owner

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri
but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

@hamadmarri

But now there is a compile error happening kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance'

@raykzhao
Copy link

Hi @ltsdw @hamadmarri

I think the compiling error is because the nohz_newidle_balance is not defined when CONFIG_NO_HZ_COMMON=n and CONFIG_CACULE_RDB=y. Please try the following fix:

--- a/kernel/sched/fair.c	2021-08-18 22:39:26.513174343 +1000
+++ b/kernel/sched/fair.c	2021-08-18 22:38:19.322803092 +1000
@@ -11084,9 +11084,9 @@
 {
 	return false;
 }
+#endif
 
 static inline void nohz_newidle_balance(struct rq *this_rq) { }
-#endif
 
 #endif /* CONFIG_NO_HZ_COMMON */
 

fix.patch.zip

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

@hamadmarri @raykzhao

Ok, I tested with CONFIG_HZ_PERIODIC=y and at least for me the hangs still.
Now I'll try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Just a question, should I still use the CONFIG_HZ_PERIODIC=y or not?

@JohnyPeaN
Copy link

@hamadmarri CONFIG_HZ_PERIODIC=y removes the random lags and game is smooth even with RDB. Tried also the other suggested config options, but nothing noticeable happened.

@ltsdw
Copy link
Author

ltsdw commented Aug 18, 2021

@hamadmarri @JohnyPeaN

Just tested with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

and also didn't work, the hangs still happening for me.

@raykzhao
Copy link

Hi @ltsdw

Another thing I would suspect is the autogroup. Have you tried to disable the autogroup? You may try to add noautogroup in your kernel boot command-line parameter.

@ptr1337
Copy link

ptr1337 commented Aug 19, 2021

Even when using games with futex2 i dont face in any issues.

Going to test again, but im sure there is no problem.

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 20, 2021

Hi @hamadmarri,

Since majority of the issues reported here happen during wine/gaming, maybe it is a good idea to look at the locking. I suspect maybe there are some issues in latest rdb/autogroup with futex2. Also some game developers are known to use locking mechanisms in the way that it is not supposed to be used.

Hi @raykzhao

I am not sure actually because most of the feedback are not strongly related. Non fixes worked with @ltsdw but with @JohnyPeaN changing to periodic hz worked. Also, I have tested my proposed fix and it reduces the performance to be worse than CFS balancer. I guess the best way is to make RDB works with periodic hz and without {auto, fair}_group. The locking issues on games could be a reason but why it is ok with CFS balancer and bad on RDB? I thought it was because the CFS balancer goes through softirq but even with the fix where I made the RDB balancer use softirq it didn't fixed the freezes. I am afraid it is due something else that RDB didn't take care of.

If you don't mind @ALL could you please attach the cpu topology with lstopo. It could be related to shared core balancing or number of CPUs in which many locking is an issue.

Thank you

@ltsdw
Copy link
Author

ltsdw commented Aug 20, 2021

@hamadmarri

I don't know if it was to put an image here or something else, but here:

Screenshot-20-08-2021_09-45-25

Thank you for your support!

@raykzhao
Copy link

Hi @hamadmarri,

This is my laptop:
Screenshot_2021-08-21_01-04-06

@JohnyPeaN
Copy link

@hamadmarri
this is the machine on which I'm testing:

Screenshot_lstopo_2

@MoisesMH
Copy link

Hey @hamadmarri

This is the machine I'm testing:
AMD Ryzen 5 3600 6-core processor
2x8GB DDR4 2666 RAM
256GB NVMe M.2 SSD
2TB HDD Drive
4GB GDDR6 VRAM RX 5500 XT

lstopo

@JohnyPeaN
Copy link

@hamadmarri
I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB.
With compositor enabled:

cacule = no lags
cacule + rdb = heavy lags
cacule + rdb + fix = very short, but frequent and noticeable lags
cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:

41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 20, 2021

@hamadmarri
I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB.
With compositor enabled:

cacule = no lags
cacule + rdb = heavy lags
cacule + rdb + fix = very short, but frequent and noticeable lags
cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:

41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock

Hi @JohnyPeaN

I think it is related to tick update where RDB-r3 needs to update the highest IS task in every tick. However, previous RDB version was using a bit different approach since enqueue was sorted.

Do the lags happen on previous RDB version (where no sched_group support)?

Thank you for the observation 👍

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 20, 2021

Hey @hamadmarri

This is the machine I'm testing:
AMD Ryzen 5 3600 6-core processor
2x8GB DDR4 2666 RAM
256GB NVMe M.2 SSD
2TB HDD Drive
4GB GDDR6 VRAM RX 5500 XT

lstopo

Hi @MoisesMH

Just to double check, could you please try with CONFIG_HZ_PERIODIC=y without the fix patch. I recommend using make menuconfig to enable CONFIG_HZ_PERIODIC since it does set the corresponding configs automatically so you don't need to worry about other CONFIG_NO_HZ_* settings.

What I am thinking is that you and @JohnyPeaN have many CPUs where there are high probability that some of them turn to idle state and no_hz wake up didn't work with RDB. Also I am afraid that @ltsdw needs to retry with CONFIG_HZ_PERIODIC=y and make sure no compilation errors and check if CONFIG_HZ_PERIODIC=y is enabled after installation.

Another suspicion is that the RDB-r3 balance tries to pick from all tasks in rq where some of them are in RT policy! In contrast, previous RDB version was just using rq->cfs tasks to balance. So, it could be that the plasma compositor are a RT task policy (not sure), but if it is the case, then RDB is keep balancing RT task (due to moving one task a time) and cfs tasks are not balanced at all (during the freezes). Could @JohnyPeaN please check what policy the plasma compositor is?

I am 100% sure that RDB is not considering the nohz kicking to wakeup idle cpus, and if setting periodic tick works for all of you, then we know that it is about nohz wakeup kicker. However, if @ltsdw still has freezes while using periodic tick, then we might have another issue as well.

Please make sure that:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

During testing to make sure that the freezes are not related to the cache or starve scores.

Thank you

@MoisesMH
Copy link

@ltsdw mentioned he used rdb without autogroup and it gave him no spikes.

At the moment, I've compiled the kernel I've used without the patch and these parameters:

RDB Interval: 19 (default).
CONFIG_HZ_1000=y
CONFIG_SCHED_AUTOGROUP=n
CONFIG_NO_HZ=y (I've read it's used for old configs, so I kept it enabled)
CONFIG_NO_HZ_IDLE=y (Tickless idle)

Also I've tweaked some options for the kernel configuration. I'll post it here just in case you want to take a look:

https://drive.google.com/file/d/1eR6NIPe88lc1SCz_nqGjNXPPRPOSavv0/view?usp=sharing

For me it's weird because 15 minutes ago I was testing about 30 minutes of gameplay in Star Wars Battlefront II. I was using Mangohud latest version from AUR (not the mangohud-git one). The first 15 minutes approximately I've experienced no spikes at all and the framerate was constant and smooth, but, since then, I've encountered some little ones every 5 minutes I guess, which lasted 2 seconds each. Then it seemed spikes were gone, until my game froze 5 secs, just like when I've got autogroup enabled. After the freeze, audio and video were unpair for a second and then it turned back to normality. So It's more related to heavy workload, as the title of this forum suggests. My CPU usage was about 54 to 59% during gameplay and GPU at 99%, which is expected because of the graphics card rendering the shaders and everything else. I was using RDB-r2 I guess, because it's included in the linux-tkg kernel provided by @TkGlitch. I put the links below:

Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"):
https://github.com/Frogging-Family/linux-tkg/blob/master/customization.cfg
https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/CacULE/RDB/rdb.patch#L56

So that is the cacule-rdb version I'm using. Should I test RDB-r3 or RDB-r2 is fine? I'm not sure exactly which version that commit belongs to, but I can test compiling it manually. I don't know if the AUR version linux-cacule-rdb presents these problems too, but I'll try first the one included on the linux-tkg kernel. I prefer it, because it has more patches which can increase performance and improve the cpu efficiency. However, I'm starting to think one of those patches could be causing the problem too.

On the other hand, those theories you mention can be possible. I haven't tested without the compositor. I don't know how I could deactivate it. I'll search for that and test without it too. Currently I'm using OpenGL 2.0. There's also OpenGL 3.1 available. I've read many people suggested Compton as a replacement. I could test it too. That's my progress till now. I'll keep testing and I'll notify if CONFIG_HZ_PERIODIC=y and the parameters kernel.sched_cache_factor = 0 and kernel.sched_starve_factor = 0 make any difference. Thanks for the reply!

@ltsdw
Copy link
Author

ltsdw commented Aug 21, 2021

hi @hamadmarri

Just recompiled here, with CONFIG_HZ_PERIODIC=y and tested with:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

but no difference, I'm still experiencing the hangs.

Also was mentioned the compositor here, I don't know if disabling the compositor worked for you @JohnyPeaN, but I tried disabling the compositor here and didn't make any difference (but I'm using picom, not plasma).

So far what worked was disabling RDB altogether or using noautogroup.

@JohnyPeaN
Copy link

@hamadmarri i'm not sure which process is responsible for compositing, but I think its kwin_x11. Anyway it has normal priority (0) as the rest of the desktop. I will try to change its priority if it has an effect.

Earlier RDB versions had these problems for me. Maybe they changed a little. Earlier RDB couldn't utilize all cores during compilation with #threads=#cores. This seems to be better now. In regards to these lags in game, it was similar.

I'm also testing if foreground processes are affected by heavy background processes, like mentioned compilation withnice -19. This doesn't work good for me on anything except BMQ (but bmq is changing priorities of processes on the fly, which is maybe a little bit cheating).

Also, I'm not recompiling to test autogroup on/off. Just to confirm does
echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
switch it off?

@ltsdw
Copy link
Author

ltsdw commented Aug 22, 2021

Also, I'm not recompiling to test autogroup on/off. Just to confirm does
echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
switch it off?

@JohnyPeaN
Yeah I do think so, you can also use the kernel command line noautogroup.

@MoisesMH
Copy link

MoisesMH commented Aug 22, 2021

hey @hamadmarri
I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o
CC fs/crypto/keysetup_v1.o
CC fs/verity/signature.o
CC arch/x86/events/amd/uncore.o
CC fs/notify/notification.o
CC mm/maccess.o
AR fs/verity/built-in.a
CC mm/page-writeback.o
CC fs/crypto/policy.o
CC fs/notify/group.o
CC kernel/sched/cputime.o
CC kernel/sched/idle.o
CC arch/x86/events/amd/ibs.o
CC fs/crypto/bio.o
CC fs/notify/mark.o
CC arch/x86/events/amd/iommu.o
CC fs/crypto/inline_crypt.o
CC kernel/sched/fair.o
CC kernel/sched/rt.o
CC fs/notify/fdinfo.o
CC mm/readahead.o
CC [M] arch/x86/events/amd/power.o
AR fs/crypto/built-in.a
CC mm/swap.o
AR fs/notify/built-in.a
CC fs/nfs_common/nfs_ssc.o
kernel/sched/fair.c: In function ‘newidle_balance’:
kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration]
11324 | nohz_newidle_balance(this_rq);
| ^~~~~~~~~~~~~~~~~~~~
| nohz_run_idle_balance
CC [M] fs/nfs_common/nfsacl.o
AR arch/x86/events/amd/built-in.a
CC arch/x86/events/intel/core.o
CC arch/x86/events/intel/bts.o
CC arch/x86/events/zhaoxin/core.o
CC [M] fs/nfs_common/grace.o
LD [M] fs/nfs_common/nfs_acl.o
CC mm/truncate.o
AR fs/nfs_common/built-in.a
CC fs/iomap/trace.o
CC mm/vmscan.o
AR arch/x86/events/zhaoxin/built-in.a
CC mm/shmem.o
CC fs/iomap/apply.o
CC arch/x86/events/intel/ds.o
CC fs/iomap/buffered-io.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make: *** [Makefile:1862: kernel] Error 2
make: *** Waiting for unfinished jobs....
CC fs/iomap/direct-io.o
CC arch/x86/events/intel/knc.o
CC mm/util.o
CC mm/mmzone.o
CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 22, 2021

hey @hamadmarri
I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o
CC fs/crypto/keysetup_v1.o
CC fs/verity/signature.o
CC arch/x86/events/amd/uncore.o
CC fs/notify/notification.o
CC mm/maccess.o
AR fs/verity/built-in.a
CC mm/page-writeback.o
CC fs/crypto/policy.o
CC fs/notify/group.o
CC kernel/sched/cputime.o
CC kernel/sched/idle.o
CC arch/x86/events/amd/ibs.o
CC fs/crypto/bio.o
CC fs/notify/mark.o
CC arch/x86/events/amd/iommu.o
CC fs/crypto/inline_crypt.o
CC kernel/sched/fair.o
CC kernel/sched/rt.o
CC fs/notify/fdinfo.o
CC mm/readahead.o
CC [M] arch/x86/events/amd/power.o
AR fs/crypto/built-in.a
CC mm/swap.o
AR fs/notify/built-in.a
CC fs/nfs_common/nfs_ssc.o
kernel/sched/fair.c: In function ‘newidle_balance’:
kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration]
11324 | nohz_newidle_balance(this_rq);
| ^~~~~~~~~~~~~~~~~~~~
| nohz_run_idle_balance
CC [M] fs/nfs_common/nfsacl.o
AR arch/x86/events/amd/built-in.a
CC arch/x86/events/intel/core.o
CC arch/x86/events/intel/bts.o
CC arch/x86/events/zhaoxin/core.o
CC [M] fs/nfs_common/grace.o
LD [M] fs/nfs_common/nfs_acl.o
CC mm/truncate.o
AR fs/nfs_common/built-in.a
CC fs/iomap/trace.o
CC mm/vmscan.o
AR arch/x86/events/zhaoxin/built-in.a
CC mm/shmem.o
CC fs/iomap/apply.o
CC arch/x86/events/intel/ds.o
CC fs/iomap/buffered-io.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make: *** [Makefile:1862: kernel] Error 2
make: *** Waiting for unfinished jobs....
CC fs/iomap/direct-io.o
CC arch/x86/events/intel/knc.o
CC mm/util.o
CC mm/mmzone.o
CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

Hi @MoisesMH

Could you please try this fix #47 (comment)

I will update the fix in the github soon.

Thanks

EDIT:

bb77376

@MoisesMH
Copy link

MoisesMH commented Aug 24, 2021

hey @hamadmarri
I've compiled a kernel with your latest commit and applied some additional patches, but the kernel was not appropriately working, because, when gaming, the framerates weren't balanced and the CPU usage was too high (I guess that happened because of esync; futex2 was not working, even if I patched it. So I proceeded to test the fix you suggested me to try in the last message you wrote applied to the TkGlitch's linux-tkg kernel, which has an earlier version of your scheduler I guess:

Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"):
https://github.com/Frogging-Family/linux-tkg/blob/master/customization.cfg
https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/CacULE/RDB/rdb.patch#L56

I've got to say, in my system, even with CONFIG_HZ_PERIODIC=y, it's still having lag spikes, but they're less frequent than with CONFIG_NO_HZ_IDLE=y. I also used the variables you suggested in my /etc/sysctl.conf

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

and then executed "sudo sysctl --system" to apply the changes to kernel in the document but, still, those hangs are present. Disabling autogroup (kernel.sched_autogroup_enabled=0) helped a little to reduce the frequency of those lag spikes and its duration (lasted up to 2 secs each hang when it happens). Before, when CONFIG_NO_HZ_IDLE=y, they lasted 5 secs at average. In the menu, everything is smooth, even on gameplay, when the hangs are not present, the game runs butter-smoothly. Oh, another detail is that, while a hang is present, the Mangohud overlay reveals the CPU usage soared 10% more on average (from 55% to 65%, even it reached 74%). It's weird that It just happens on heavy workload. For other tasks, like running Audacious or Lutris, it's noticeably faster than without RDB. It surprises me the celerity at opening different applications. For those jobs it's butter-smooth, but just happens when at intensive gameplay. That's all I got. I really don't have an idea why it just happens at intensive workload. Maybe the code is not adapted to deal with it and just with ordinary tasks. I'll remain here for more news. Thanks for the effort. Keep it up!

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

@hamadmarri
Copy link
Owner

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

Hi @MoisesMH
Updated 5.14
https://github.com/hamadmarri/cacule-cpu-scheduler/tree/master/patches/CacULE/v5.14

Thank you

@MoisesMH
Copy link

MoisesMH commented Aug 24, 2021

Nice! Thanks to you. I don't know but lastly I've tried the liquorix kernel with the MuQSS scheduler (CONFIG_HZ_100=y is default), android modules, ntfs3 and uksm. What surprises me is the CPU usage. At the game menu of Star Wars Battlefront II, your scheduler with linux-tkg (CONFIG_HZ_1000=y is default) consumes 24% to 32-33% of CPU usage, but, with, this new kernel, it was reaching a whopping 54% to 59% of CPU Usage. I don't believe it's uksm which is incrementing CPU Usage, because its main function is memory deduplication. It's not possible in my opinion. Also, at gameplay, your scheduler were around 54% to 62% of CPU Usage, while lqx-kernel with MuQSS reached from 66% up to 79%. It's impressive how optimized the linux-tkg kernel is compared to liquorix. Well I haven't tried the linux-tkg with uksm. I'm going to compile it now and see how it does with CacULE with and without RDB for testing. Keep it up!

@hamadmarri
Copy link
Owner

It could be kernel.sched_cacule_yield related to the issue.
Can you please try with

kernel.sched_cacule_yield = 0

Thank you

@MoisesMH
Copy link

MoisesMH commented Aug 25, 2021

Hey @hamadmarri
I've used kernel.sched_cacule_yield=0 in my sysctl.conf, but it didn't help. Instead, it became unstable and I saw more lag spikes during co-op gameplay, but not at the game menu. So it performs noticeably better when kernel.sched_cacule_yield=1. Oh, I've compiled with CONFIG_NO_HZ_IDLE=y and a rdb interval of 15. Also, I was testing with UKSM. I noticed there are less frequent lag spikes with this configuration. I don't know which configuration helped to neutralize some of the hangs: the new RDB interval with CONFIG_NO_HZ_IDLE=y or UKSM could be helping too. I wonder what the results would be when using periodic ticks and kernel.sched_cacule_yield=0. Cheers!

@hamadmarri
Copy link
Owner

hamadmarri commented Aug 25, 2021

Hi @MoisesMH , @raykzhao , @JohnyPeaN , @ltsdw , @ptr1337 , @SoongVilda

I am planning to make a rework on RDB and start it over from the beginning. I need to review how nohz idle wakeup mechanism works first. Also I am thinking to make some extra features where some CPUs are assigned to be an interactive tasks servant (where it gives more priority to interactive tasks but still can run non-interactive tasks at the same time). This idea are based on this (https://www.researchgate.net/profile/Julien-Soula/publication/254213707_ARTiS_an_Asymmetric_Real-Time_Scheduler_for_Linux_on_Multi-Processor_Architectures/links/00b495350104a70d19000000/ARTiS-an-Asymmetric-Real-Time-Scheduler-for-Linux-on-Multi-Processor-Architectures.pdf)

The next RDB must consider all nohz work, and maybe a global queue for candidates tasks in which one task from each CPU (the task that has the highest IS but not running). Each CPU will have one slot in the global queue and it must guarantee that the task that is advertised in the global queue must be ready to migrate at any time, unless the slot has a null value.

The locking number could be increased but the queue is not very big it only contains nproc items.

Thanks

@MoisesMH
Copy link

MoisesMH commented Aug 26, 2021

Hey @hamadmarri
That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at #43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:

  1. One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss)
    sched_cache_factor = 7972
    sched_starve_factor = 19930

  2. No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly)
    sched_cache_factor = 3986
    sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

@hamadmarri
Copy link
Owner

Hey @hamadmarri
That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at #43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:

  1. One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss)
    sched_cache_factor = 7972
    sched_starve_factor = 19930
  2. No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly)
    sched_cache_factor = 3986
    sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

Hi @MoisesMH

The cache factor seems not working good with RDB design. I need to troubleshoot cache and starve factors too.

Thank you

@MoisesMH
Copy link

Yeah, it seems to be generating the issue. I've discovered another combination, which is close to the default I think:

kernel.sched_cache_factor = 10629
kernel.sched_starve_factor = 21258

I've experienced no spikes at all with this configuration, but at the beginning of gameplay a peak happened, but I haven't noticed any freezes or big hangs. I think I'll remain with this configuration. Both sums less than the sched_interactivity_factor, also one is the double of the other (1/3 * 31888, 2/3 * 31888). Hope you're doing great with your investigation and development!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants