Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

Closed
FR-ADDIX opened this issue Dec 13, 2023 · 48 comments
Closed

Orion-LD (1.4.0) crashes with double free detected in tcache 2 #1499

FR-ADDIX opened this issue Dec 13, 2023 · 48 comments
Assignees

Comments

@FR-ADDIX
Copy link

ORION-LD V 1.4.0
Kubernetes v1.26.9

time=Wednesday 13 Dec 10:10:22 2023.838Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=mongocInit.cpp[58]:mongocLog | msg=MONGOC[mongoc]: Falling back to malloc for counters.                        
free(): double free detected in tcache 2                                                                                                                                                                                                      
Stream closed EOF for fiware/orion-ld-deployment-67cf4d9559-lscl2 (orion-ld)   
@kzangeli
Copy link
Collaborator

"tcache 2" ?
Not a clue what that might be ...
I¡m going to need some more info here to have a chance to look into this issue.
The usual stuff, how the broker was started, what requests are sent to the broker ...

@ibordach
Copy link

This is very difficult to reproduce. The double free crash always comes with the "Falling back to malloc for counters" from mongoc.

    spec:
      containers:
      - args:
        - -dbhost
        - mongodb-headless.fiware.svc.cluster.local:27017/
        - -logLevel
        - DEBUG
        - -ctxTimeout
        - "10000"
        - -mongocOnly
        - -lmtmp
        - -reqPoolSize
        - "120"
        - -logSummary
        - "30"
        env:
        - name: ORIONLD_CORE_CONTEXT
          value: v1.0
        - name: ORIONLD_MONGO_REPLICA_SET
          value: rs0
        - name: ORIONLD_MONGO_USER
          value: root
        - name: ORIONLD_MONGO_TIMEOUT
          value: "5000"
        - name: ORIONLD_MONGO_POOL_SIZE
          value: "180"
        - name: ORIONLD_MONGO_ID_INDEX
          value: "TRUE"
        - name: ORIONLD_STAT_COUNTERS
          value: "TRUE"
        - name: ORIONLD_STAT_SEM_WAIT
          value: "TRUE"
        - name: ORIONLD_STAT_TIMING
          value: "TRUE"
        - name: ORIONLD_STAT_NOTIF_QUEUE
          value: "TRUE"
        - name: ORIONLD_SUBCACHE_IVAL
          value: "10"
        - name: ORIONLD_DISTRIBUTED
          value: "TRUE"
        - name: ORIONLD_CONTEXT_DOWNLOAD_TIMEOUT
          value: "10000"
        - name: ORIONLD_NOTIF_MODE
          value: threadpool:252:4
        - name: ORIONLD_DEBUG_CURL
          value: "TRUE"

@ibordach
Copy link

It looks looks like a race condition under "high" load.

@ibordach
Copy link

Currently we reduced our setup from 3 to one orion-ld and no subscriptions. The orion gets round about 2000 messages per second and permanently died with this double free error. The same happens with 3 orions or 6 orions.

@FR-ADDIX
Copy link
Author

Are you using C++ and DynamicArray somewhere?
"free(): double free detected in tcache 2" seems to be an error message if more elements are deleted from the array than are present.

@kzangeli
Copy link
Collaborator

kzangeli commented Dec 17, 2023

Are you using C++ and DynamicArray somewhere?

Not a fan of C++. I'm a C guy ... I'm replacing most of the old C++ code of Orion for pure C in Orion-LD.

Now, if you could start the broker inside GDB or Valgrind, we'd get a stack trace.
That would be extremely helpful to understand where the problem comes from.
Either that, or a way for me to reproduce the crash.

@ibordach
Copy link

ok, we have the chance to get the core files. We also startet a pod with a gdb version of orion-ld. Now we have no debuginfos for gdb. How can we generate more information for you. I'm not a C guy. ;-)

Orion: 1.4.0-PRE-1425-GDB

(gdb) bt
#0  0x00000000004bb2a4 in ?? ()
#1  0x00000000004974ad in ?? ()
#2  0x00000000007fa132 in ?? ()
#3  0x00000000007fb870 in ?? ()
#4  0x0000000000803454 in ?? ()
#5  0x0000000000804105 in ?? ()
#6  0x00007fc6a18541cf in start_thread () from /lib64/libpthread.so.0
#7  0x00007fc69cd36dd3 in clone () from /lib64/libc.so.6
(gdb)

@ibordach
Copy link

sorry for the Image
image

@kzangeli
Copy link
Collaborator

ok, getting closer.
mongoc_collection_destroy called twice.
That's already a good hint.
However, all in between is symbol less so, it could be better,
I think we already create debug images on dockerhub.
Would just be a question of using one of those instead (with debug symbols and not stripped).
This one for example: 1.5.0-PRE-1551-debug (or just any image whose name ends in "-debug")

Now, when you go from RELEASE (-O2) compilation to DEBUG (-g) compilation, the executable changes. Quite a lot.
Often. bugs disappear, or are moved elsewhere.
Let's see if we're lucky!

@ibordach
Copy link

now with 1.5.0-PRE-1551-debug:

(gdb) bt
#0  0x00007f4cb1ed0ee1 in abort () from /lib64/libc.so.6
#1  0x00007f4cb1f40037 in __libc_message () from /lib64/libc.so.6
#2  0x00007f4cb1f4719c in malloc_printerr () from /lib64/libc.so.6
#3  0x00007f4cb1f48f0d in _int_free () from /lib64/libc.so.6
#4  0x00007f4cb2e1623e in mongoc_collection_destroy (collection=0x7f4c3000e250) at /cmake-3.14.5/mongo-c-driver-1.22.0/src/libmongoc/src/mongoc/mongoc-collection.c:246
#5  0x000000000048fa0d in requestCompleted(void*, MHD_Connection*, void**, MHD_RequestTerminationCode) ()
#6  0x0000000000807622 in MHD_connection_close_ (connection=connection@entry=0x7f4c30000d60, termination_code=<optimized out>) at connection.c:842
#7  0x0000000000808949 in connection_reset (reuse=false, connection=0x7f4c30000d60) at connection.c:4328
#8  MHD_connection_handle_idle (connection=connection@entry=0x7f4c30000d60) at connection.c:4832
#9  0x000000000080bbb9 in call_handlers (con=con@entry=0x7f4c30000d60, read_ready=<optimized out>, write_ready=<optimized out>, force_close=<optimized out>) at daemon.c:1234
#10 0x0000000000810a48 in MHD_epoll (daemon=0x2245f90, millisec=<optimized out>) at daemon.c:5149
#11 0x0000000000811695 in MHD_polling_thread (cls=0x2245f90) at daemon.c:5386
#12 0x00007f4cb6a011cf in start_thread () from /lib64/libpthread.so.0
#13 0x00007f4cb1ee8dd3 in clone () from /lib64/libc.so.6
(gdb)

@kzangeli
Copy link
Collaborator

ok!
Here's the code snippet:

  //                                                                                                                                                                                  
  // Release the connection to mongo - after all notifications have been sent (orionldAlterationsTreat takes care of that)                                                            
  //                                                                                                                                                                                  
  if (orionldState.mongoc.contextsP)       mongoc_collection_destroy(orionldState.mongoc.contextsP);
  if (orionldState.mongoc.entitiesP)       mongoc_collection_destroy(orionldState.mongoc.entitiesP);
  if (orionldState.mongoc.subscriptionsP)  mongoc_collection_destroy(orionldState.mongoc.subscriptionsP);
  if (orionldState.mongoc.registrationsP)  mongoc_collection_destroy(orionldState.mongoc.registrationsP);

I'll look into that. See if I can find the error.
Would be so much easier if I could reproduce the error myself, but ...

@FR-ADDIX
Copy link
Author

Can you run multiple Orion-LD processes and send them against a MogoDB replica set?
Wir haben 3 Orion-LD die gegen eine MongoDB mit 2 Replikationen schreibt.
We currently have around 26,000 entities in the database, so perhaps that is another factor that causes the error to be triggered more quickly.
If we then write around 100 msg/sec against the CB, it runs stable for longer, if we increase the load to 1500 msg/sec, the crash can be produced relatively quickly.

@kzangeli
Copy link
Collaborator

ok, I can do that.
Good input, thanks

@DasNordlicht
Copy link

We have now significantly reduced the entity and are currently still looking at around 2000 vehicles.
This has made the Orion LD more stable.

However, this cannot be the solution, but only a temporary solution.
The aim is to reach well over 100,000 entities.

@kzangeli
Copy link
Collaborator

Yeah, of course not. That's not the solution

@ibordach
Copy link

ibordach commented Feb 8, 2024

Are there any ideas how we can support this issue? Same crashes with 1.6.0-PRE-1561.
That already causes us real pain.

@kzangeli
Copy link
Collaborator

kzangeli commented Feb 8, 2024

Yeah, so sorry, this issue is long overdue.
I can only say I'm sorry.
Our development team is tiny, to say the least and these last few months I've been busy elsewhere, with various European projects and ETSI ISG CIM ...

That said, I took a quick look at this this morning and I've come up with a "desperate" attempt.
Let's start there (I'll let you know when the docker image is ready) and hopefully I'll have some more time in the near future.

kzangeli added a commit that referenced this issue Feb 8, 2024
kzangeli added a commit that referenced this issue Feb 8, 2024
@kzangeli
Copy link
Collaborator

kzangeli commented Feb 8, 2024

The PR #1548 has been merged into develop and the docker image should reach dockerhub in 15-20 minutes. It's usually that fast.

Now, this is not that I've found the issue and fixed it.
We'll see if the modification changes anything.

@ibordach
Copy link

ibordach commented Feb 8, 2024

:-) We will test it as soon as it will be available.

@kzangeli
Copy link
Collaborator

kzangeli commented Feb 8, 2024

1.6.0-PRE-1574

@ibordach
Copy link

ibordach commented Feb 8, 2024

up and running :-), we will check it

@kzangeli
Copy link
Collaborator

kzangeli commented Feb 8, 2024

So, this "fix" is about MHD (3rd party lib for HTTP) calls the "request completed" callback more than once for one and the same connection. That would be one way for the error to appear. I kind of doubt it though.

Another possibility is that mongoc_client_get_collection returns the same mongoc_collection_t twice.
Also hard to believe, but I just saw that 'mongo_client_t` is not thread safe, and a client pointer is input to that function.
So, a semaphore there, around getting collections seems like a good idea.
I'll implement that now and we'll test again once the first test fails :)

kzangeli added a commit that referenced this issue Feb 8, 2024
One more fix for #1499 - protected mongocConnectionGet with a semaphore
@kzangeli
Copy link
Collaborator

kzangeli commented Feb 8, 2024

So, I merged that second fix as well, semaphore around getting the connection to mongo.
The docker image as usual in 15 min.

I'm far from convinced that any of this will fix the issue you're seeing, but ... it might ;)

@ibordach
Copy link

ibordach commented Feb 9, 2024

crash with 1574, now testing 1575

@ibordach
Copy link

ibordach commented Feb 9, 2024

A first cautious feedback after round about 2 and a half hours:
so far we have not seen this type of error and only running into this issue #1441

@kzangeli
Copy link
Collaborator

kzangeli commented Feb 9, 2024

ok, after a week or so we might assume the semaphore fixed the problem.

While looking at this yesterday, I found an alternative to using that "thread unsafe" mongo_client_t.
If this semaphore fixed the problem, then surely mongo_client_t is the culprit.
I'll try top replace all of those for the alternative. Abd, remove the semaphore, as it is not really needed (only the usage of mongo_client_t makes it needed)

@kzangeli
Copy link
Collaborator

So, still no more crashes of that type?

I'll update #1441 right now.

@kzangeli kzangeli self-assigned this Feb 27, 2024
@ibordach
Copy link

We only notice the FD_SETSIZE error since your semaphore fix. :-)

@kzangeli
Copy link
Collaborator

We only notice the FD_SETSIZE error since your semaphore fix. :-)
That is interesting indeed.
The extra semaphore slightly changes the timing of the broker. It makes a bit slower.
Of course it means that the change to mongoc 1.24 won't make any difference, but that was just a "hit in the dark" anyway.

Now, with this piece of information it seems almost clear to me that you suffer from a lack of resources.
It really seems that way.
In the end it's the broker's fault, should manage resources and slow down incoming requests and stuff like that.
Wish I had the resources to implement all that!
But I don't, unfortunately.

My suggestion then is that you add resources, to both the broker and to mongo and up the max FD size to a million.
See how things turn out.
It doesn't seem like a leak of FDs. It seems like the broker is too overloaded.

@ibordach
Copy link

ibordach commented Feb 28, 2024

I don't think that the broker is too overloaded. It consumes about 15% of a CPU core and about 100MB of RAM. To address the FD problem we started 6 orions. The mongodb also has no ressource issues. It consumes 50% of a CPU core and about 2GB of RAM. And there are plenty of cores and RAM available.

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

@ibordach
Copy link

The mongodb has about 2 * 300 open connections.

@ibordach
Copy link

ibordach commented Feb 28, 2024

manual restart of the orions in the morning and reducing it to 5 instances:
image
image

@kzangeli
Copy link
Collaborator

The semaphore fix is not the starting point of the FD_SETSIZE error. We had that error before.

So, this comment wasn't really correct then :)

We only notice the FD_SETSIZE error since your semaphore fix. :-)

@ibordach
Copy link

Yes, it is the last error left. :-)

@kzangeli
Copy link
Collaborator

Now, if mongo has 600 connection, but you get to FDs over 1565, we need to find those other almost 1000 FDs ...
Any info you find will help me to understand where the error is.

That said, I'd up the FD MAX to a LARGE number and hope it's just normal.
Slow mongo, for example.
Frankly, I have no clue what is causing this problem.
Very interested in fixing it though

@kzangeli
Copy link
Collaborator

Question no 1 is:

  • Is it really a leak of file descriptors?
    and no 2, if (1) is YES:
  • What fd is it?
    We already know it's a socket, but, of course, pretty much all FDs are sockets ...

@ibordach
Copy link

We only get tons of this within a second and after that a crash (aka restart)

New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).

#1441

@kzangeli
Copy link
Collaborator

Yeah, we need more input.
We already removed the Prometheus thingy.
We also tried without notifications (zero subscriptions).
You don't use registrations (right?)
The next step would be the new mongoc driver and I'm on that, just, doesn't look too good about fixing this.

I'll try to find something more to "rule out".

Not easy, but, we'll get there

@kzangeli
Copy link
Collaborator

We finally solved the base image issue (not 100% but almost).
There's a new image on dockerhub:
1.6.0-PRE-1587

That one uses the newer version 1.24.2 of mongoc that actually supports mongodb 7.0.
So, let's try again and see what happens ...

Would be nice also to test with a less recent release of the mongo server. 4.4, 5.0, 6.0 ...
Not that we're a bit desperate ... :)

@ibordach
Copy link

We wave a cluster with mongodb 6 and one with mongodb 7. Both are now running with 1.6.0-PRE-1587. We will see. :-)

@ibordach
Copy link

The bandwidth of one orion-ld instance:
image
at 10:15 we startet some more orion instances, so the traffic is somewhat lower per instance.

the replicaset of mongodb has no stress. The number of connections increases slowly over time. bandwidth usage is relaxed
image

@kzangeli
Copy link
Collaborator

So, not looking good then :(
If we only knew what FD is growing !!!

@ibordach
Copy link

the instance consumes a little bit of ram an then dies.
image

@kzangeli
Copy link
Collaborator

Different crash then?
Not too many open file descriptors?

@ibordach
Copy link

Same error FD_SETSIZE

@kzangeli
Copy link
Collaborator

I did a quick search on the error (should have done that ages ago ...).
The "article" is old, I know, but it might still be interesting.
Have a look at it and tell me what you think:
https://www.mongodb.com/community/forums/t/mongodb-4-2-19-extremely-high-file-descriptor-counts-70k/170223

@kzangeli
Copy link
Collaborator

Ah, we're commenting in the wrong issue.
Mea culpa.

We should be in #1441.
I'll move my last comment there.
This issue is about the crash that was presumable fixed with that semaphore I added.
Can this issue be closed? (and we continue the discussion in #1441?)

@ibordach
Copy link

Yes, this can be closed.

@kzangeli kzangeli closed this as completed Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants