Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HANG while running gcc with sample cbr #6853

Open
chenhy0106 opened this issue Jun 23, 2024 · 6 comments
Open

HANG while running gcc with sample cbr #6853

chenhy0106 opened this issue Jun 23, 2024 · 6 comments
Labels

Comments

@chenhy0106
Copy link
Contributor

chenhy0106 commented Jun 23, 2024

Describe the bug
When running gcc hello.c -o hello with the client cbr, dynamorio hangs. This was observed for both x86 and RISCV64.

However, when we do not use dr_flush_region() and dr_redirect_execution() in cbr.c, everything is OK! As we have inserted the jmp instruction after clean call, cbr still works correctly without dr_flush_region() and dr_redirect_execution() (with performance loss perhaps).

So, dr_redirect_execution() may involves some bugs. As we invoke prepare_for_call_ex() and other 'save' operations before clean call, I'm not sure it's right not to call cleanup_after_call_ex() to restore when using the dr_redirect_execution() routine.

X86 output log is:

$ ./bin64/drrun -c api/bin/libcbr.so -- gcc ../scripts/test_clients/hello.c -o ../scripts/test_clients/hello_x86  
<Starting application /usr/bin/x86_64-linux-gnu-gcc-9 (28411)>
<Initial options = -no_dynamic_options -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x0000000072002480
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007f15a5058000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x0000000073004cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007f13a4afc630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007f15a4fda100
>
<spurious rep/repne prefix @0x00007f15a4ebfdf0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007f15a4fc3000-0x00007f15a4fd9000 prot=0x00000001
        cache says: 0x00007f15a4fc3000-0x00007f15a4fda000 prot=0x00000001
>
<CURIOSITY : (0) && "failed to suspend thread in 5s" in file /mnt/e/GitHub/dynamorio/core/unix/os.c line 3856
version 10.92.19891, custom build
-no_dynamic_options -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_li
0x00007f13a50910c0 0x00007f15a52e4095
0x00007f13a5091100 0x00007f15a51ee4a1
0x00007f13a5091a90 0x00007f15a51efee3
0x00007f13a5091b60 0x00007f15a508f04b
0x00007f13a5091c20 0x00007f15a5091217
0x00007f13a5091c70 0x00007f15a5092376
0x00007f13a5091cf0 0x00007f15a5092b45
0x00007f13a5091d50 0x00007f15a520f15a
0x00007f13a5091d90 0x00007f15a520f18f
0x00007f13a5091db0 0x0000000072002738
0x00007f13a50926f0 0x000000004491aed9
/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so=0x00007f15a500a000
/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so=0x0000000072000000
/lib/x86_64-linux-gnu/libc.so.6=0x00007f13a4ada000
/usr/lib64/ld-linux-x86-64.so.2=0x00007f15a4fd9000
/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so=0x0000000073000000>
...... (repeating)

Expected behavior
DO NOT use dr_redirect_execution() in cbr, we can run the program successfully:

$ ./bin64/drrun -c api/bin/libcbr.so -- gcc ../scripts/test_clients/hello.c -o ../scripts/test_clients/hello_x86
<Starting application /usr/bin/x86_64-linux-gnu-gcc-9 (23304)>
<Initial options = -no_dynamic_options -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x00000000720013c0
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007f76ec622000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x0000000073004cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007f74ec0c6630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007f76ec5a4100
>
<spurious rep/repne prefix @0x00007f76ec489df0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007f76ec58d000-0x00007f76ec5a3000 prot=0x00000001
        cache says: 0x00007f76ec58d000-0x00007f76ec5a4000 prot=0x00000001
>
<-- execve /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -->
<Starting application /usr/lib/gcc/x86_64-linux-gnu/9/cc1 (23305)>
<Initial options = -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x00000000720013c0
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007fecc8588000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x0000000073004cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007fecc8179630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007fecc850a100
>
<spurious rep/repne prefix @0x00007fecc6c8ddf0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007fecc84f3000-0x00007fecc8509000 prot=0x00000001
        cache says: 0x00007fecc84f3000-0x00007fecc850a000 prot=0x00000001
>
<(1+x) Handling our fault in a TRY at 0x00007fecc87f4d4e>
<Stopping application /usr/lib/gcc/x86_64-linux-gnu/9/cc1 (23305)>
<-- execve /home/chy/.local/bin/as -->
<-- execve /usr/local/sbin/as -->
<-- execve /usr/local/bin/as -->
<-- execve /usr/sbin/as -->
<-- execve /usr/bin/as -->
<Starting application /usr/bin/x86_64-linux-gnu-as (23306)>
<Initial options = -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x00007f202076c3c0
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007f2064912000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x00007f20207b2cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007f2064427630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007f2064894100
>
<spurious rep/repne prefix @0x00007f20647e9df0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007f206487d000-0x00007f2064893000 prot=0x00000001
        cache says: 0x00007f206487d000-0x00007f2064894000 prot=0x00000001
>
<(1+x) Handling our fault in a TRY at 0x00007f2064b7ed4e>
<Stopping application /usr/bin/x86_64-linux-gnu-as (23306)>
<-- execve /usr/lib/gcc/x86_64-linux-gnu/9/collect2 -->
<Starting application /usr/lib/gcc/x86_64-linux-gnu/9/collect2 (23307)>
<Initial options = -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x00000000720013c0
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007f628d279000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x0000000073004cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007f608cd9f630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007f628d1fb100
>
<spurious rep/repne prefix @0x00007f628d162df0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007f628d1e4000-0x00007f628d1fa000 prot=0x00000001
        cache says: 0x00007f628d1e4000-0x00007f628d1fb000 prot=0x00000001
>
<-- execve /usr/bin/ld -->
<Starting application /usr/bin/x86_64-linux-gnu-ld.bfd (23308)>
<Initial options = -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so' 0x00007f02c57e23c0
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so' 0x00007f0309b7d000
add-symbol-file '/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so' 0x00007f02c5828cb0
add-symbol-file '/lib/x86_64-linux-gnu/libc.so.6' 0x00007f0309594630
add-symbol-file '/usr/lib64/ld-linux-x86-64.so.2' 0x00007f0309aff100
>
<spurious rep/repne prefix @0x00007f0309956df0 (f3 0f 1e fa): >
<curiosity: rex.w on OPSZ_6_irex10_short4!>
<get_memory_info mismatch! (can happen if os combines entries in /proc/pid/maps)
        os says: 0x00007f0309ae8000-0x00007f0309afe000 prot=0x00000001
        cache says: 0x00007f0309ae8000-0x00007f0309aff000 prot=0x00000001
>
<(1+x) Handling our fault in a TRY at 0x00007f0309de9d4e>
<Stopping application /usr/bin/x86_64-linux-gnu-ld.bfd (23308)>
<Stopping application /usr/lib/gcc/x86_64-linux-gnu/9/collect2 (23307)>
<Stopping application /usr/bin/x86_64-linux-gnu-gcc-9 (23304)>

Other information

  • Without any clients, everything is OK.
  • Under my test, some simple applications (such as hello_world.c, ls and so on) can run correctly.
  • cbr sample has not been ported to AARCH64 formally, but I have tried it and found that it still faces the same problem of dr_redirect_execution().
@derekbruening derekbruening changed the title HANG while running with sample cbr HANG while running gcc with sample cbr Jun 24, 2024
@derekbruening
Copy link
Contributor

This looks like a failure to suspend all the threads (a "synchall") when running gcc. That has other ramifications as synchall is used for other purposes beyond flushing the cache for this cbr sample: it is needed for attach and detach and certain other central operations.

The question is, what is gcc doing that is unusual. Is it using vfork or unusual clone variations to create weird threads? There's a chain of sub-processes here, but the first execve output is not shown at the hang point: but did it already do a vfork? One theory to investigate is whether an unusual thread is the cause of the problem as DR tries to suspend it yet it has its own signal handler table or sthg.

The debug logs should give substantial help here. -loglevel 2 might be enough; might need 3. Look at the sync and suspend log messages and see if there are errors when it sends a signal to the other threads. How many other threads are there? How were they created: what flags to clone? Is vfork involved?

@chenhy0106
Copy link
Contributor Author

Firstly, the problem is indeed about dr_flush_region() but not about dr_redirect_execution(). When just removing dr_flush_region(), everything is OK for all architectures.

Then, I dump the debug logs and find that gcc uses vfork . However, the first vfork fails (clone flag is 0x4111, CLONE_VFORK | CLONE_VM). More specifically, a new thread has been successfully created but the parent thread does not return from system call.

The last few lines of parent's thread log:

...
Entry into do_syscall to execute a non-ignorable system call
system call 58
syscall: vfork
add_process_lock: 0 lock 0x00007f35dfbbd7b0: name=child_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/signal.c:742
rank=87 owner=30458 owning_dc=0x00007f35dfb92980 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 child_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/signal.c:742
create_clone_record: dstack for new thread is 0x00007f35dfcdf000
allocated clone record: 0x00007f35dfcde880
create_clone_record: thread 30458, pc 0x00007f35df6150dc
TLS copy is 0x00007f35dfb531f0

After that, no more output log. By removing dr_flush_region() in cbr.c, we can get the subsequent output log, which can be referenced:

post syscall: sysnum=0x000000000000003a, result=0x00000000000017c9 (6089)
syscall: vfork returned 0x00000000000017c9
vfork: restoring xsp from 0x00007fd3a22c9880 to 0x00007ffda9b93df0
finished handling system call

The new thread's log:

DynamoRIO version 10.92.19896, custom build, (c) DynamoRIO developers
DynamoRIO built with: -DINTERNAL -DDEBUG -DDRSTATS_DEMO -DX86 -DX64 -DLINUX -DUNIX -DDR_HOST_X86 -DDR_HOST_X64 -DAPP_EXPORTS -DKSTATS -DBUILD_TESTS -DHAVE_FVISIBILITY -DHAVE_FNOSANITIZE_NULL -DHAVE_TYPELIMITS_CONTROL -DANNOTATIONS -DHAVE_RSEQ -DDR_DO_NOT_DEFINE_uint -DDR_DO_NOT_DEFINE_ushort -DDR__Bool_EXISTS -DASSEMBLE_WITH_GAS -DHAVE_MEMINFO -DHAVE_MEMINFO_MAPS -DHAVE_TLS -DHAVE_SIGALTSTACK -DUSE_VISIBILITY_ATTRIBUTES -DDR_APP_EXPORTS -DDEBUG_MEMORY -DHEAP_ACCOUNTING -DDEADLOCK_AVOIDANCE -DSHARING_STUDY -DHASHTABLE_STATISTICS
DynamoRIO built on: Jun 25 2024 16:08:47
THREAD 30492 (dcontext 0x00007f35dfc05a00)

DR stack is 0x00007f35dfcd1000-0x00007f35dfcdf000 (passed in 0x00007f35dfcdf000)
thread_locks=0x00007f35dfb54690 size=8
thread_stats=0x00007f35dfcf5050 size=6952
thread_kstats=0x00007f35dfcf6b80 size=3192
threads_started
signal stack is 0x00007f35dfd0c000 - 0x00007f35dfd14000
TLS app lib base  =0x00007f37dfc0c600
TLS app alt base  =0x0000000000000000
TLS priv lib base =0x00007f35dfce5700
TLS priv alt base =0x00007f35dfce1000
TLS DynamoRIO base=0x00007f35dfce1000
post-TLS-setup, cur gs base is 0x00007f35dfce1000
post-TLS-setup, cur fs base is 0x00007f35dfce5700
hashtable_generic_init trace heads htable=0x00007f35dfce8af0 bits=9 size=513 mask=0x00000000000001ff offset=0 load=75% resize=384
               trace heads table 0x00007f35dfce8cf8  0x0000000000000000  groom=0% groom_at=0
hashtable_fragment_init bblock htable=0x00007f35dfc07f70 bits=5 size=33 mask=0x000000000000001f offset=0 load=60% resize=19
               bblock table 0x00007f35dfcf0048  0x0000000000000000  groom=0% groom_at=0
hashtable_fragment_init future htable=0x00007f35dfc081b0 bits=5 size=33 mask=0x000000000000001f offset=0 load=65% resize=21
               future table 0x00007f35dfcf0248  0x0000000000000000  groom=0% groom_at=0
hashtable_ibl_init ret_trace htable=0x00007f35dfc07820 bits=7 size=129 mask=0x000000000000007f offset=0 load=50% resize=64
               ret_trace table 0x00007f35dfcf0480  0x0000000000000000  groom=0% groom_at=0
hashtable_ibl_init indcall_trace htable=0x00007f35dfc07958 bits=7 size=129 mask=0x00000000000007f0 offset=4 load=50% resize=64
               indcall_trace table 0x00007f35dfcf0cc0  0x0000000000000000  groom=0% groom_at=0
hashtable_ibl_init indjmp_trace htable=0x00007f35dfc07a90 bits=7 size=129 mask=0x000000000000007f offset=0 load=50% resize=64
               indjmp_trace table 0x00007f35dfcf1500  0x0000000000000000  groom=0% groom_at=0
continuation pc is 0x00007f35df6150dc
parent tid is 30458, parent sysnum is 58(vfork), clone flags=0x4111
inheriting signal handlers from parent
	handler for signal 1 is 0x000000000040a830
	handler for signal 2 is 0x000000000040a830
	handler for signal 13 is 0x000000000040a830
	handler for signal 15 is 0x000000000040a830
	handler for signal 17 is 0x0000000000000000
	handler for signal 34 is 0x0000000000000001
thread has private itimers (for now)
(Begin) Thread statistics @26260 global, 0 thread fragments (8:21.793)add_process_lock: 0 lock 0x00007f35dfcf5058: name=thread_stats_lock(mutex)@/mnt/e/GitHub/dynamorio/core/utils.c:3043
rank=98 owner=30492 owning_dc=0x00007f35dfc05a00 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 thread_stats_lock(mutex)@/mnt/e/GitHub/dynamorio/core/utils.c:3043
:
                                Special heap units (thread):                 1
                           Peak special heap units (thread):                 1
                  Special heap align space (bytes) (thread):                 8
             Current special heap capacity (bytes) (thread):              4096
                Peak special heap capacity (bytes) (thread):              4096
                   Current heap units on live list (thread):                 1
                      Peak heap units on live list (thread):                 1
                              Heap headers (bytes) (thread):                88
                          Heap align space (bytes) (thread):                12
                     Peak heap align space (bytes) (thread):                12
                     Heap bucket pad space (bytes) (thread):              1856
                Peak heap bucket pad space (bytes) (thread):              1856
                            Heap allocs in buckets (thread):                25
                        Heap allocs variable-sized (thread):                11
                             Total reserved memory (thread):            282624
                        Peak total reserved memory (thread):            282624
               Guard pages, reserved virtual pages (thread):                 6
          Peak guard pages, reserved virtual pages (thread):                 6
                    Current stack capacity (bytes) (thread):             32768
                       Peak stack capacity (bytes) (thread):             32768
                              Heap claimed (bytes) (thread):             90848
                         Peak heap claimed (bytes) (thread):             90848
                     Current heap capacity (bytes) (thread):            110592
                        Peak heap capacity (bytes) (thread):            110592
           Heap reserved but not committed (bytes) (thread):            114688
      Peak heap reserved but not committed (bytes) (thread):            176128
              Current total memory from OS (bytes) (thread):            282624
                 Peak total memory from OS (bytes) (thread):            282624
           Current vmm blocks for unreachable heap (thread):                50
              Peak vmm blocks for unreachable heap (thread):                50
                      Current vmm blocks for stack (thread):                10
                         Peak vmm blocks for stack (thread):                10
   Current vmm blocks for unreachable special heap (thread):                 9
      Peak vmm blocks for unreachable special heap (thread):                 9
                  Our virtual memory blocks in use (thread):                69
             Peak our virtual memory blocks in use (thread):                69
             Allocations using multiple vmm blocks (thread):                 3
                Blocks used for multi-block allocs (thread):                69
         Current vmm virtual memory in use (bytes) (thread):            282624
            Peak vmm virtual memory in use (bytes) (thread):            282624
(End) Thread statistics
Thread 30492 KSTATS {
             logging:  17492908 totc,      97 num,    154380 minc,    180339 avg,    320868 maxc,  17492908 self,         0 sub,
                            7 ms,         0 ms out,in LOG
} KSTATS
thread 30492 under DR control
add_process_lock: 0 lock 0x00007f35dfce84f8: name=sigmask_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/signal.c:738
rank=66 owner=30492 owning_dc=0x00007f35dfc05a00 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 sigmask_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/signal.c:738
thread 30492's initial app signal mask:
starting DR itimers from thread 30492

(Other fragment logs ... )

dr_flush_region_ex: 0x00007f35df6150e1-0x00007f35df6150e2
FLUSH STAGE 1: synch_unlink_priv(thread 30492 flushtime 2): 0x00007f35df6150e1-0x00007f35df6150e2
synch with all threads my id = 30492 Giving 4 permission and seeking 4 state
Looping over all threads (2 threads)
add_process_lock: 0 lock 0x00007f35dfb52818: name=synch_lock(mutex)@/mnt/e/GitHub/dynamorio/core/synch.c:144
rank=10 owner=30492 owning_dc=0x00007f35dfc05a00 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 synch_lock(mutex)@/mnt/e/GitHub/dynamorio/core/synch.c:144
About to try synch with thread #0/2 30458
Synching with thread 30458, giving 0, requesting 4, blocking=0
Looping on synch with thread 30458
add_process_lock: 0 lock 0x00007f35dfbbd058: name=suspend_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/os.c:2582
rank=64 owner=30492 owning_dc=0x00007f35dfc05a00 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 suspend_lock(mutex)@/mnt/e/GitHub/dynamorio/core/unix/os.c:2582
add_process_lock: 0 lock 0x00007f37e00b4a40: name=report_buf_lock(mutex)@/mnt/e/GitHub/dynamorio/core/utils.c:2015
rank=90 owner=30492 owning_dc=0x00007f35dfc05a00 contended_event=0xffffffff prev=0x0000000000000000
lock count_times_acquired=       1                              0                               0                              0                               0+2 report_buf_lock(mutex)@/mnt/e/GitHub/dynamorio/core/utils.c:2015
SYSLOG_WARNING: CURIOSITY : (0) && "failed to suspend thread in 5s" in file /mnt/e/GitHub/dynamorio/core/unix/os.c line 3856
version 10.92.19896, custom build
-no_dynamic_options -loglevel 3 -client_lib '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -client_lib64 '/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so;0;' -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exe
0x00007f35dfcdd0c0 0x00007f37dff2e11e
0x00007f35dfcdd100 0x00007f37dfe384a1
0x00007f35dfcdda90 0x00007f37dfe39ee3
0x00007f35dfcddb60 0x00007f37dfcd904b
0x00007f35dfcddc20 0x00007f37dfcdb217
0x00007f35dfcddc70 0x00007f37dfcdc376
0x00007f35dfcddcf0 0x00007f37dfcdcb45
0x00007f35dfcddd50 0x00007f37dfe5915a
0x00007f35dfcddd90 0x00007f37dfe5918f
0x00007f35dfcdddb0 0x0000000072002738
0x00007f35dfcde6f0 0x0000000045d2fbc1
/mnt/e/GitHub/dynamorio/build_x86/lib64/debug/libdynamorio.so=0x00007f37dfc54000
/mnt/e/GitHub/dynamorio/build_x86/api/bin/libcbr.so=0x0000000072000000
/lib/x86_64-linux-gnu/libc.so.6=0x00007f35df724000
/usr/lib64/ld-linux-x86-64.so.2=0x00007f37dfc23000
/mnt/e/GitHub/dynamorio/build_x86/ext/lib64/debug/libdrmgr.so=0x0000000073000000
DynamoRIO call stack:
Thread 30492 call stack:
	frame ptr 0x00007f35dfcdcf50 => parent 0x00007f35dfcdd0c0, 0x00007f37dfd434fe  
	frame ptr 0x00007f35dfcdd0c0 => parent 0x00007f35dfcdd100, 0x00007f37dff2e11e  
	frame ptr 0x00007f35dfcdd100 => parent 0x00007f35dfcdda90, 0x00007f37dfe384a1  
	frame ptr 0x00007f35dfcdda90 => parent 0x00007f35dfcddb60, 0x00007f37dfe39ee3  
	frame ptr 0x00007f35dfcddb60 => parent 0x00007f35dfcddc20, 0x00007f37dfcd904b  
	frame ptr 0x00007f35dfcddc20 => parent 0x00007f35dfcddc70, 0x00007f37dfcdb217  
	frame ptr 0x00007f35dfcddc70 => parent 0x00007f35dfcddcf0, 0x00007f37dfcdc376  
	frame ptr 0x00007f35dfcddcf0 => parent 0x00007f35dfcddd50, 0x00007f37dfcdcb45  
	frame ptr 0x00007f35dfcddd50 => parent 0x00007f35dfcddd90, 0x00007f37dfe5915a  
	frame ptr 0x00007f35dfcddd90 => parent 0x00007f35dfcdddb0, 0x00007f37dfe5918f  
	frame ptr 0x00007f35dfcdddb0 => parent 0x00007f35dfcde6f0, 0x0000000072002738  
	frame ptr 0x00007f35dfcde6f0 => parent 0x0000000000000001, 0x0000000045d2fbc1  

@derekbruening
Copy link
Contributor

Who is the target thread of the suspend? Is it the vfork child which is no longer there? CLONE_VFORK suspends the parent until the child execs at which point the child should not be on the list of threads in the parent process. Is it incorrectly still on the list and that's why the parent thinks it has to suspend it? Though you'd think the debug build synchall at exit time would have hit such a relatively simple bug in our existing vfork tests (IIRC there are some).

@chenhy0106
Copy link
Contributor Author

chenhy0106 commented Jun 26, 2024

The child thread calls dr_flush_region(), which in turn calls os_thread_suspend() to send a suspend signal to the parent thread. The child thread waits for the parent's suspend flag to be set as 1. However, the parent thread can't handle the signal because it's already in a suspended state (due to vfork). This results in a deadlock.

I suspect the issue is that we don't set the parent's suspend state when processing vfork.

I switched from using gcc to a simple program that only includes vfork and execl, but the same hang problem occurred. So, this issue seems to be independent of gcc.

@chenhy0106
Copy link
Contributor Author

My test program vfork.c:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    pid_t pid = vfork();
    if (pid == 0) {
        printf("Child PID = %d\n", getpid());

        execl("/bin/echo", "echo", "From the child process!", NULL);
    } else if (pid > 0) {
        printf("Parent PID = %d, waiting for child...\n", getpid());

        int status;
        if (waitpid(pid, &status, 0) < 0) {
            perror("waitpid failed");
            exit(EXIT_FAILURE);
        }
    }

    return 0;
}

Command:

$ gcc vfork.c -o vfork_test
$ ./bin64/drrun -c api/bin/libcbr.so -- vfork_test

@derekbruening
Copy link
Contributor

We have suite/tests/linux/vfork.c and suite/tests/linux/vfork-fib.c: I guess none are ever run with a client or anything that does a synchall in the middle; the exit-time debug-build synchall works b/c the child and parent are separated at that point. So we can make a regression test using the existing vfork test app with the cbr sample.

For how to handle: to DR the vfork child is a different-thread-group different-sighand-table thread. DR's signal and itimer and other handlers should already handle the different sighand table and different thread group. It is just this suspension that is missing. If the kernel doesn't resume until execve or exit in the child, I guess DR needs new state in thread_record_t indicating that a thread is vfork-suspended (we know it's at a good point at do-syscall gencode as vfork should never be inlined into a fragment) and we need the suspend-thread code to check for that and we need the execve and exit syscall handling code to clear that state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants