Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get libcluster to connect over Kubernetes :pods :ip #163

Closed
Sieabah opened this issue Aug 13, 2021 · 0 comments
Closed

Unable to get libcluster to connect over Kubernetes :pods :ip #163

Sieabah opened this issue Aug 13, 2021 · 0 comments

Comments

@Sieabah
Copy link

Sieabah commented Aug 13, 2021

Steps to reproduce

  • Configuration Used
    libcluster ~3.3
config :libcluster,
  topologies: [
    default_topology: [
      strategy: Elixir.Cluster.Strategy.Kubernetes,
      config: [
        mode: :ip,
        kubernetes_ip_lookup_mode: :pods,
        kubernetes_node_basename: "myapp",
        kubernetes_selector: "appGroup=myapp",
        kubernetes_namespace: "default",
        polling_interval: 10_000
      ]
    ]
  ]

It uses a shared service specifically to group the pods and make them searchable

apiVersion: v1
kind: Service
metadata:
  labels:
    name: myapp
    appGroup: "myapp"
  name: myapp-discovery
spec:
  ports:
    - port: 4369
      targetPort: 4369
  selector:
    appGroup: "myapp"
  type: NodePort
  • Strategy Used: Elixir.Cluster.Strategy.Kubernetes

  • Errors/Incorrect Behaviour Encountered "unable to connect" with no logging, literally nothing else is done.

def start(_type, _args) do
    topologies = Application.get_env(:libcluster, :topologies)

    children = [
      {Cluster.Supervisor, [
        Application.get_env(:libcluster, :topologies),
        [name: MyApp.ClusterSupervisor],
      ]},
      # Start the Telemetry supervisor
      MyAppWeb.Telemetry,
      # Start the PubSub system
      {Phoenix.PubSub, name: MyApp.PubSub},
      # Start the Endpoint (http/https)
      MyAppWeb.Endpoint,
      # Presence
      MyAppWeb.RoomPresence,
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
end
#env.sh.eex
export RELEASE_DISTRIBUTION=name

# Environment https://hexdocs.pm/mix/Mix.Tasks.Release.html
export RELEASE_NODE=${BASENAME_GROUP}@${K8_POD_IP}

The endpoints requested properly show up when I run k get endpoints

myapp-deployment       10.7.0.173:4369,10.7.0.227:4369,10.7.1.13:4369 

I've tried with 127.0.0.1, the pod id, hardcoding the basename, making it from the environment, using the release.name from elixir. Nothing has worked or has given me any other output other than "unable to connect". The K8_POD_IP is retrieved from the kubernetes environment status.podIP, it's valid. I've connected to the box and verified every single environment variable is correct. I've verified every single sourced environment works as expected.

Description of issue

All my pod manages to echo out is that it cannot connect and that's it. It's effectively a dead pod which start no other processes or services and never pings or has another connection error/warning. I have zero idea what it's doing or why it's failing. This is a built OTP release from a regular phoenix/elixir application.

04:42:21.386 [warn]  Description: 'Authenticity is not established by certificate path validation'
     Reason: 'Option {verify, verify_peer} and cacertfile/cacerts is missing'
04:42:28.520 [warn]  [libcluster:myapp] unable to connect to :"[email protected]"
04:42:35.521 [warn]  [libcluster:myapp] unable to connect to :"[email protected]"

I'm at a complete loss as to how to debug this since there is zero output when I do anything. The IPs differ between the logs and the endpoint due to when I ran the commands, no they are not incorrect, no that is not the issue. If I kill the pod and it restarts it gives me the IPs within the endpoint list plus the IP of the pod which I just terminated.

Edit:

I've done about 40-50 deployments trying to get some combination of these fields to work and the only thing I've managed to outside of no logs is when I use the vm.args.eex with the following.

#vm.args.eex

-name ${BASENAME_GROUP}@${K8_POD_ID}

This gives me a nice constant stream of logs of the following, endlessly flowing.

05:59:59.917 [warn]  [libcluster:myapp] unable to connect to :"[email protected]"
05:59:59.917 [error] ** System NOT running to use fully qualified hostnames **
** Hostname 10.7.1.13 is illegal **

So I went on to find this post. So I switch everything to use this config instead.

#env.sh.eex
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE="<%= @release.name %>@$(echo "$K8_POD_IP" | sed 's/\./-/g').${SERVICE_NAME}.${K8_NAMESPACE}.svc.cluster.local"
# config
config :libcluster,
  topologies: [
    myapp: [
      strategy: Elixir.Cluster.Strategy.Kubernetes.DNSSRV,
      config: [
        service: System.get_env("SERVICE_NAME", nil),
        application_name: System.get_env("BASENAME", "myapp"),
        polling_interval: 10_000
      ]
    ]
  ]

Why is there such a horrific lack of documentation on this process or even a basic example that actually works? Why are the configurations both mixed in code blocks as well as paragraphs? I'm having to find my solution it github issues rather than in the docs.

Surprise surprise the docs are wrong, you end up with a KeyError for namespace, which is included in the comment but not in the docs. More specifically 'Elixir.KeyError',key => namespace

image

I'm not sure if this is just an overlooked element but I've never had a more frustrating experience trying to get something to work than having to deal with this package. It seems to be such a common issue judging by the sheer volume of solutions I've found within github issues across a variety of projects. So I hope maybe by writing out it'll get picked up in a search and save someone the hours of increased blood pressure I've been through.

Unsurprisingly after adding the namespace I get back to my original issue where it logs twice saying it cannot connect and it is completely dead and stops logging. I'm going to stop trying to debug this for now. Hopefully someone, anyone, knows what I'm doing wrong.

06:49:14.669 [warn]  [libcluster:myapp] unable to connect to :"[email protected]"
06:49:14.704 [warn]  [libcluster:myapp] unable to connect to :"[email protected]"

Edit2: After more time has passed I'm and deleting the deployments and just overall resetting every single component. Magically, all of a sudden the issue has gone away.

@Sieabah Sieabah changed the title Unable to get libcluster to connect over Kubernetes Unable to get libcluster to connect over Kubernetes :pods :ip Aug 13, 2021
@Sieabah Sieabah closed this as completed Aug 13, 2021
cdesch added a commit to cdesch/libcluster that referenced this issue Nov 21, 2021
Add missing `namespace` to Documentation Example for Strategy.Kubernetes.DNSSRV 
Reference Issue [163](bitwalker#163) regarding docs.
bitwalker pushed a commit that referenced this issue Jan 18, 2022
Add missing `namespace` to Documentation Example for Strategy.Kubernetes.DNSSRV 
Reference Issue [163](#163) regarding docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant