-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Route resolution failure on Client + Segfault on Mellanox NIC #48
Comments
Hi. Thanks for reporting this issue. Can you comfirm if |
On the server I started with ib_read_bw and on the client with ib_read_bw with . Going by the output it, I believe it works. Nonetheless attaching the trace, as I am just starting with running RDMA applications. The server side trace
Dual-port : OFF Device : mlx5_0 local address: LID 0000 QPN 0x02c9 PSN 0x2df5d7 OUT 0x10 RKey 0x003572 VAddr 0x007f65b20e9000 #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] On the client.
Dual-port : OFF Device : mlx5_0 local address: LID 0000 QPN 0x0239 PSN 0x2d5540 OUT 0x10 RKey 0x003582 VAddr 0x007f1a3666e000 #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] |
Thanks for the details. Could you please test if eRPC works on your setup with older Mellanox drivers (e.g., Mellanox OFED 4.4)? There have been a lot of recent NIC driver changes and I've not kept the code up to date. I am aware that eRPC doesn't build anymore with the Raw transport with new Mellanox OFED versions (or rdma_core) because the ibverbs API has changed. I plan to fix this eventually but I'm not sure when I'll have the time. |
Hi Dr. Kalia I use the 4-node cluster as servers 2-node cluster as clients. On the client side I have
First I thought it was due to invalid LIDs ( Could you give any advice for further troubleshooting? |
Hi! The verbs address handle creation process is a bit complex so it's likely I missed something in my implementation of
-DROCE=on if you're using RoCE.
My suggestion to fix this would be to see how the |
Hi Dr. Kalia Thanks to your precise analysis, I am able to find that the
, the kDefaultGIDIndex works for the most of time, but unluckily, my two clusters have different NIC configurations. Thus the default value picks the wrong GID in one cluster when RoCE is enabled, thus the two clusters fail to communicate.
The reason |
Hi,
OFED version 5.0.2
NIC Mellanox Connect X-5
OS Ubuntu 18
A tofino Switch does simple packet forwarding for us.
If I try with -DTransport=infiband and -DROCE=on I am able to build successfully and when I use the hello world app, on the client side I get the following error
Received connect response from [H: 192.168.1.4:31850, R: 0, S: XX] for session 0. Issue: Error [Routing resolution failure]
The server is receiving the initial connect packet from the client and then client segfaults and server prints the below statement in a loop. The error on the server is as follows.
Received connect request from [H: 192.168.1.3:31850, R: 0, S: 0]. Issue: Unable to resolve routing info [LID: 0, QPN: 449, GID interface ID 16601820732604482458, GID subnet prefix 33022]. Sending response.
In the README, you mentioned to use -Dtransport=raw for Mellanox NIC's. I was not able to build with that flag. Error Trace We want to use eRPC over ROCEv2 + DCQCN. We are okay with IB, unless you tell us otherwise. The RDMA devices are on on
rdma link
andibdev2netdev
.The text was updated successfully, but these errors were encountered: