Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elfloader: improve stability #191

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

andybui01
Copy link

@andybui01 andybui01 commented Feb 28, 2024

As mentioned in #190, this is the series of commits that do general touch-ups in the elfloader.

This should reduce the Jetson Orin PR to 1 commit, and the aarch64 booting sequence PR (these commits rely on each other) to 6 commits, which should be more manageable, although I'll prepare those PRs are this passes (hopefully).

@@ -134,6 +144,13 @@ void smp_boot(void)
arm_disable_dcaches();
#endif
init_cpus();
#if defined(CONFIG_ARCH_AARCH64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could avoid the code duplication:

#if defined(CONFIG_ARCH_AARCH64)
    dsb();
#endif

    non_boot_lock = 1;

#if defined(CONFIG_ARCH_AARCH64)
    /* Secondary CPUs may still run with MMU & caches off. Force the update to be visible. */
    asm volatile("dc civac, %0\n\t" :: "r"(&non_boot_lock) : "memory");;
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why you do the dsb() before updating non_boot_lock actually. This should get a comment that explains this. Seems we want to ensure all changes done far get visible, before we release the other cores. We should do the dsb() unconditionally then, not just on aarch64.

    /* ensure all operation complete before we release the other cores */
    dsb(); 
    /* release secondary cores */
    non_boot_lock = 1;
    /* Secondary cores may still run with MMU & caches off. Force the update to be visible. */ 
#if defined(CONFIG_ARCH_AARCH64) 
    asm volatile("dc civac, %0\n\t" :: "r"(&non_boot_lock) : "memory"); 
#else
    // ToDo: add flush operation here to be safe.
#endif

@@ -31,6 +31,9 @@ SECTIONS
*(.data)
*(.data1)
*(.data.*)
__start__driver_list = .;
*(_driver_list)
Copy link
Member

@axel-h axel-h Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the commit comment, isn't the driver list even "ro-data" actually and should be put there?

The commit comment also contains the question "The driver list entry is present in the aarch32 linker script for
EFI, why was it missing for 64 bit?
" It should be removed provide the proper answer instead, that this was missing and is added now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is within the riscv linker script:

    . = ALIGN(16);
    .rodata :
    {
        *(.srodata*)
        . = ALIGN(16);
        *(.rodata)
        *(.rodata.*)
        /*
         * ld crashes when we add this here: *(_driver_list)
         */
        . = ALIGN(16);
        _archive_start = .;
        *(._archive_cpio)
        _archive_end = .;
    }

Moving the _driver_list into rodata may be breaking.

Copy link
Member

@axel-h axel-h Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the commit history give something why 'ld' is crashing here? Maybe this is fixed by now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit history didn't give any hints unfortunately. I don't have the risc-v toolchain currently so I'll try and build when I get the chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also have a look on RISC-V.

#define GET_PGD_INDEX(x) (((x) >> (ARM_2MB_BLOCK_BITS + PMD_BITS + PUD_BITS)) & MASK(PGD_BITS))
#define GET_PUD_INDEX(x) (((x) >> (ARM_2MB_BLOCK_BITS + PMD_BITS)) & MASK(PUD_BITS))
#define GET_PMD_INDEX(x) (((x) >> (ARM_2MB_BLOCK_BITS)) & MASK(PMD_BITS))
#define GET_PGD_INDEX(x) (((word_t)(x) >> (ARM_2MB_BLOCK_BITS + PMD_BITS + PUD_BITS)) & MASK(PGD_BITS))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit comment it says that this works for pointer now also, but pointer must be cast to uintptr_t first to make then an integer. Maybe add this here also then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the reason behind the comment is that in tools/seL4/elfloader-tool/include/types.h "word_t is practically an alias for size_t/uintptr_t on the platforms we support so far". Do you mean to just change word_t to uintptr_t here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think strict C rules say, that pointers must be cast to uintptr_t to make them an integer, from there you can cast the integer to any other integer type.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is reasonable, I did find this excerpt in the C99 standard though, and it's the only part I could find on uintptr_t:

The following type designates an unsigned integer type with the property that any valid
pointer to void can be converted to this type, then converted back to pointer to void,
and the result will compare equal to the original pointer: uintptr_t

So the strictest form would be a cast to void * first before uintptr_t. However, I think all the reasonable implementations have sensible ways for any pointer casted to uintptr_t to behave such that we can skip the void * cast.

@@ -34,7 +34,11 @@ void non_boot_main(void)
#endif
/* Spin until the first CPU has finished initialisation. */
while (!non_boot_lock) {
#ifndef CONFIG_ARCH_AARCH64
#ifdef CONFIG_ARCH_AARCH64
/* The compiler may optimize this loop away, add a dsb()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems incorrect, the variable is declared volatile, so the compiler is not allowed to optimize this out. However, the barrier might sill be needed to force a OO-pipeline not to try smart things. Instead of the barrier, we might also consider using https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html to let the compiler pick the appropriate meachnism.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're correct it's declared volatile so this is not needed. I can add a asm volatile("": : :"memory") to the loop body to let the compiler handle this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the comment. The compiler can't optimize the loop away, because the variable is volatile. However, we may need the explicit CPU barrier instruction between the read instruction stream created by the loop to ensure the CPU does not try to optimize out multiple reads to the same location.

I wonder, have you seen that without the barrier things really fail or just take much longer until some synchronization kick in, as the change is not visible immediately? It might be worth adding the specific details in the commit comment then. Especially, if the cache setting between reader and writer could still be different here, then we may need explicit cache flush/invalidation here also to guarantee visibility.

I wonder, is we should even remove the #ifdef CONFIG_ARCH_AARCH64 here, as the same could also happen on ARMv7, so we dsb could be helpful here also. It does not harm at at least. Also, this code is not time critical, it's the general core synchronization that happen on boot. So we can be extra cautions here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu_idle() is just dsb() and wfi() paired, so we can make this common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. cpu_idle() is doing something completely different, so this should not be used here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code as is

    while (!non_boot_lock) {
#ifdef CONFIG_ARCH_AARCH64
        dsb();
#else
        cpu_idle();
#endif
    }

And since cpu_idle() is itself: dsb() + wfi(), the dsb() would already be common.

Copy link
Member

@axel-h axel-h Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I see. But It's a bit odd that ARMv7 is doing a wfi() here also - I can't see who triggers the event on the other core that will bring us out of the wfi then. Did this ever work or was this stuck forever then - or did something else trigger an event by chance? I can't find anything in the commit history that explains this. Except that commit #4879dfe9 disbled this for CONFIG_ARCH_AARCH64 anyway as it might not have worked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be simplified to just a barrier and we can find a way to do a cache invalidation (not just aarch64) on non_boot_lock when the boot CPU finishes initializing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since on aarch64 the D-Cache is disabled before the loop anyway, just a dsb() should be sufficient here. And it seems this worked well for your colleagues when the made the code change.
Since I'm not really sure how this is supposed to work on ARMv7/AARCH32 and we can't test this now either, it might be best to just leave it and add a comment that calling cpu_idle() this looks like a hack that should either be explained or reworked.

@axel-h
Copy link
Member

axel-h commented Mar 6, 2024

The simulation on ARMv8 fails now, could you have a look at this? Maybe you should split this PR even further up, so we can merge the trivial things faster.

@andybui01
Copy link
Author

The simulation on ARMv8 fails now, could you have a look at this?

The failure is due to the following reason: as it stands, the elfloader creates the boot page table by mapping 2 regions:

  1. The kernel region (kernel_info->virt_region_start) as MT_NORMAL memory.
  2. A bulk 512 GiB 1:1 mapping (used by the elfloader) that is strongly-ordered.

The final commit (3166422) erroneously assumes that this is region is for device memory, that is not correct. Instead, this commit will be part of the page table PR to follow, which splits the boot page table into distinct regions (elfloader - normal, kernel - normal, uart mmio - strongly ordered).

Andy Bui and others added 12 commits March 21, 2024 17:15
EFI may boot the elfloader with caches disabled on the secondary cores,
we want the value of non_boot_lock to be visible.

Some barriers are added to stabilize SMP booting in the elfloader.

Co-authored-by: Yanyan Shen <[email protected]>
Co-authored-by: Matthias Rosenfelder <[email protected]>
Signed-off-by: Andy Bui <[email protected]>
bootloader parameters.

Signed-off-by: Tw <[email protected]>
(type mismatch).

This chops off the aff3 level on Aarch64.

Why does the compiler not warn??? Because the own header was not
included. If you just include the header (without the change of the
return value in the header), we get:

seL4_tools/elfloader-tool/src/arch-arm/64/cpuid.c:14:10: error:
conflicting types for 'read_cpuid_mpidr'

   14 | uint32_t read_cpuid_mpidr(void)
      |          ^~~~~~~~~~~~~~~~

In file included from /home/mro/nvos_neu2/tools/seL4_tools/
elfloader-tool/src/arch-arm/64/cpuid.c:9:
elfloader-tool/include/arch-arm/cpuid.h:15:8: note:
previous declaration of 'read_cpuid_mpidr' was here
   15 | word_t read_cpuid_mpidr(void);
      |        ^~~~~~~~~~~~~~~~
[190/200] Building C object elfloader-tool/CMakeFiles/elfloader.dir/
src/arch-arm/smp_boot.c.obj
ninja: build stopped: subcommand failed.

Signed-off-by: Matthias Rosenfelder <[email protected]>
pagetables.

The 64 kiB alignment is a maximum requirement for a stage2
concatenated pagetable. See Table G5-4 in ARM DDI 0487I.a, page
G5-9186.

Note: Both comments at the top of the file as well as in line 85
say "64 kiB". 2^14 is unfortunately only 16 kiB.

Note2: This code is not executed on AArch64, because the
finish_relocation() function panics on AArch64. The latter always
takes the "shortcut" via continue_boot().

Signed-off-by: Matthias Rosenfelder <[email protected]>
shadowing.

The variable "dtb_size" is of type "size_t" and is defined in
src/arch-arm/sys_boot.c, line 36.

"size_t" is most certainly NOT the same size as "uint32_t", even on
32-bit architectures. Thus, the declaration in smp_boot.c is incorrect,
since it does not match the definition in sys_boot.c. Why even create
a local declaration - put this in a common header file and you will
see those problems right away. Single point of maintenance!

This may lead to an incorrectly sized memory access, that only happens
to be correct by chance in Little-Endian mode. For ARM in Big-Endian
mode this is a bug and will most likely result in an incorrect DTB
size of zero.

This fixes c573511 ("elfloader: pass DTB from bootloader to seL4 on
ARM").

Moreover, remove the shadowing of global variables by defining local
ones with the same name => Rename the local one in src/common.c.

This could have been detected with "-Wshadow".

Practically speaking our DTBs are always (a lot) smaller than 32-bit.
Thus, continue to pass a 32-bit size to the kernel in order to not
change the API here.

Signed-off-by: Matthias Rosenfelder <[email protected]>
The size calculation was incorrect, unfortunately. That lead to
an incorrect memory map provided by UEFI. When switching on EFI_DEBUG
one can see that the memory range occupied by the ELF-loader
("paddr=" line) is not fully marked as used by UEFI and the last
few pages of the ELF-loader are actually marked as being free.

Output before:

ELF-loader started on Image ranges:
  paddr=[7f9bc0000..7fcde1fff]
    text[7f9bc0000..7f9bd1e8f]
    data[7f9bd2000..7fcddd9ff]
     bss[7f9bd2850..7f9c0a4cf]
   edata[7fcddda00..7fcde1fff]
[...]
[paddr=0x180023000-0x7f9bbffff]
    [type = Conventional, attr: Normal  <- ok
[paddr=0x7f9bc0000-0x7fcdddfff]
    [type = Loader Code, attr: Normal   <- Not ok: end address too low
                                                Should be 0x4000 higher

[paddr=0x7fcdde000-0x7ffffffff]
    [type = Conventional, attr: Normal  <- Not ok: start address too low
                                                Should be 0x4000 higher

After:

ELF-loader started on Image ranges:
  paddr=[7f9bbc000..7fcdddfff]
    text[7f9bbc000..7f9bcde8f]
    data[7f9bce000..7fcdd99ff]
     bss[7f9bce850..7f9c064cf]
   edata[7fcdd9a00..7fcdddfff]
[...]
[paddr=0x180023000-0x7f9bbbfff]
    [type = Conventional, attr: Normal  <- ok
[paddr=0x7f9bbc000-0x7fcdddfff]
    [type = Loader Code, attr: Normal   <- ok (same as above)
[paddr=0x7fcdde000-0x7ffffffff]
    [type = Conventional, attr: Normal  <- ok (starts one after
                                               paddr end)

Note: You don't have that debug output (EFI_DEBUG) in your code,
that prints the UEFI memory map. So you have to believe me that
this is the actual output.

This fixes 030d83b ("elfloader: improve EFI support").

Signed-off-by: Matthias Rosenfelder <[email protected]>
The driver list section was part of the "*(COMMON)" section that
was placed *after* the image payload (kernel, rootserver etc.).
The driver list is data and should be placed adjacent to other
data belonging to the ELFloader.

This is clearly visible in the mapfile (which you don't generate).

The driver list entry is present in the aarch32 linker script for
EFI, why was it missing for 64 bit?

No functional change.

Signed-off-by: Matthias Rosenfelder <[email protected]>
Use existing defines to make the code more descriptive. For this
move some defines out of the assembler-only file.

This is a preparation patch for upcoming patches.

No functional change.

Signed-off-by: Matthias Rosenfelder <[email protected]>
Regarding right shifts the standard says:

"The type of the result is that of the promoted left operand.
The behavior is undefined if the right operand is negative, or
greater than or equal to the length in bits of the promoted left
operand."

Corresponding GCC warning (if used on a "small" type like uint8_t):

main.c:25:39: warning: right shift count >= width of type
[-Wshift-count-overflow]

\#define GET_PGD_INDEX(x)        (((x) >> (ARM_2MB_BLOCK_BITS +
PMD_BITS + PUD_BITS)) & MASK(PGD_BITS))

main.c:46:39: note: in expansion of macro ‘GET_PGD_INDEX’
   46 |     printf("GET_PGD_INDEX(x): %lu\n", GET_PGD_INDEX(x));
      |                                       ^~~~~~~~~~~~~

Thus, make sure that we never exceed/reach it by explicitly casting
to a 64-bit type.

It also allows using a pointer as macro parameter.

This is a preparation patch for upcoming patches.

Signed-off-by: Matthias Rosenfelder <[email protected]>
UEFI is an operating system that hides as a bootloader. UEFI is in
control of the machine as long as we didn't call exit_boot_services.

For instance, UEFI may set up timers to interrupt us while we're
fiddling with hardware and UEFI is fiddling with hardware itself and
UEFI may be fiddling with the exact same hardware that we are
fiddling with, while we're being preempted. That is not good.

The previous state of ELFloader is that before exiting UEFI boot
services, we already called platform_init() in main(), which may
fiddle around with all kinds of hardware. Thus, we should have
already exited UEFI boot services when main() is called.

Note that exit_boot_services now still executes on the UEFI stack
(since we switch the stack in _start()). But so did e.g. the
clear_bss() function. I don't see a problem here.

It's more a question the other way around: Previously, we called
into UEFI with exit_boot_services on our own, potentially too small,
stack. Do we have enough space for UEFI to execute? How are we
supposed to know that? The UEFI implementation can change, so we
can never be sure. But it would be unreasonable for UEFI to start
us with a stack that is too small to call any UEFI API, including
exit_boot_services. So we can safely assume that there is enough
space when using the UEFI stack (since our use of stack to this
point is minimal).

Also, mask all exceptions until we are about to enter the kernel.
We do not want to run with whatever state the bootloader set us up
before, do we? We only re-enable the asyncs and debugs; interrupts
and FIQs are still masked when entering the kernel. What would we
gain from that? We don't expect any. Asyncs (SErrors), however,
can indicate that we e.g. touched memory that we shouldn't have
touched (secure memory).

Signed-off-by: Matthias Rosenfelder <[email protected]>
There are better ways to loop in C.

No functional change.

Signed-off-by: Matthias Rosenfelder <[email protected]>
size mismatch.

The UEFI specification 2.10 says in section 7 for
EFI_BOOT_SERVICES.GetMemoryMap():

"The GetMemoryMap() function also returns the size and revision
number of the EFI_MEMORY_DESCRIPTOR. The DescriptorSize represents
the size in bytes of an EFI_MEMORY_DESCRIPTOR array element returned
in MemoryMap. The size is returned to allow for future expansion of
the EFI_MEMORY_DESCRIPTOR in response to hardware innovation.

The structure of the EFI_MEMORY_DESCRIPTOR may be extended in the
future but it will remain backwards compatible with the current
definition. Thus OS software must use the DescriptorSize to find
the start of each EFI_MEMORY_DESCRIPTOR in the MemoryMap array."

This mismatch is the case on (our) Orin UEFI. The compiled size of
a memory descriptor is 40 Bytes, but the Orin UEFI implementation
uses 48 Bytes per descriptor.

Thus, due to the requirement to use a larger size than the
returned total size (due to the fact that the buffer allocation
itself may lead to one more entry in the memory map), we must
increase by the size (in terms of number of descriptors), but
use the number of bytes that UEFI uses for one memory map entry,
not what we think it might be.

Some other people already stumbled over this:
https://forum.osdev.org/viewtopic.php?f=1&t=32953

Based on the comment in the existing code, the author seems to
not have understood how the size of the memory map can be
determined. Just read the spec!
So we better update that misleading comment.

Signed-off-by: Matthias Rosenfelder <[email protected]>
@kent-mcleod
Copy link
Member

As mentioned in #190, this is the series of commits that do general touch-ups in the elfloader.

This PR seems to not contain one of the touch up commits: 0108b77

And contains an additional commit that is not a touch up commit and changes behavior: 7e410f0

Is this intentional?

@andybui01
Copy link
Author

As mentioned in #190, this is the series of commits that do general touch-ups in the elfloader.

This PR seems to not contain one of the touch up commits: 0108b77

And contains an additional commit that is not a touch up commit and changes behavior: 7e410f0

Is this intentional?

Yeah, the first one makes less sense to me without the other commits that all attempt to solve a problem (stabilize aarch64 booting for the Orin).

As for the second, I think its impact is low enough to be included here? It doesn't seem to be that big of a behavioral change.

@Ivan-Velickovic Ivan-Velickovic added the hw-test enable sel4test hardware builds + runs label Apr 5, 2024
@andybui01
Copy link
Author

@axel-h A re-review would be appreciated if you can spare some time. Also the failing HW tests are passing when I check the log, as they reach the end but the message "All is well in the universe" becomes truncated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hw-test enable sel4test hardware builds + runs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants