Skip to content
Carlos Tojal edited this page Dec 4, 2023 · 5 revisions

Overview

The perception pipeline is entirely camera-based. The used camera is the Intel RealSense D435.

The module is composed of an object detection neural network operating on each individual color image. The bounding box midpoints in the images are reprojected to the car frame using the camera model. The intrinsic parameters of the camera are automatically provided by the RealSense, and the extrinsic parameters are defined by the team. The reprojection is done in car frame. The origin of the car frame is located at the vehicle's center of gravity.

TODO: add diagram

Cone detection

To detect the cones, the DAMO-YOLO model is used. DAMO-YOLo was introduced by Alibaba Group in 2022 with the goal of proposing a state-of-the-art object detection model for edge computing real-time application. Implementation can be found here.

The model is pre-trained with the COCO dataset. However, for Formula Student teams, the most relevant dataset is the FSOCO dataset. Our team adopts transfer learning, by preserving the weights of all the layers except the last. The last layers are changed to conform with the number of classes present in the FSOCO dataset and fine-tuning is performed.

TOOD: add the graph from paperswithcode

TODO: add reference to the paper

The provided implementation includes utilities to convert the model to ONNX or TensorRT format, which is valuable to integrate in production into the C++ codebase. However, in our case we do not use ONNX or TensorRT due to the higher complexity of implementation.

Instead, the model is integrated into the C++ code by exporting the model using the TorchScript format (.pth extension). LibTorch is used to load the model and run inferece. LibTorch was chosen instead of TensorRT or ONNX due to the minimized implementation efforts. The LibTorch API is the simplest among the referenced and provides a smooth integration.

TODO: add aa deployment timeline/graph

TODO: add a screenshot with the bounding box results

Camera Projection

The role of a camera is, in general terms, to project our 3D world into a 3D matrix, by capturing the light entering the lens.

The camera projective equation is as follows:

$$ \begin{equation} x = K[R|t]X \end{equation} $$

Where $x$ are the pixel values, $(u,v,d)$ are the pixel coordinates and $d$ is the depth value at that pixel. $K$ is the intrinsic matrix composed by:

$$ \begin{equation} K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \end{equation} $$

Where $f_x$ and $f_y$ are the focal distances and $c_x$ and $c_y$ are the focal centers. The intrinsic parameters are obtained by camera calibration. They describe where in the image plane objects observed in the world are projected.

The $R$ matrix is the rotation matrix and $t$ is the translation vector with shapes $(3\times3)$ and $(3\times1)$ respectively.

In the context of monocular cameras, the depth is lost after projection. However, in stereo cameras, the depth is estimated. This allows the aforementioned projection equation to be changed to find 3D points from the pixel values.

By changing the equation to solve in order to the world points $X$, it can be written as follows:

$$ \begin{equation} X = (R^{-1}K^{-1}x)-t \end{equation} $$

This equation allows the local mapping module to find out the coordinates of the cones in the car frame from the depth values. In particular, the center point of each bounding box is used for the sake of simplicity.

The code below is used for this task. Our particular implementation uses the Eigen linear algebra library, which achieves a high performance by building the expression trees at compile time. It also prevents errors because it checks dimensionality compatibility at compile time.

#include <local_mapping_ros/vision/ReconstructionFromDepth.h>

namespace t24e::local_mapper::vision {
    geometry_msgs::msg::Point ReconstructionFromDepth::deprojectPixelToPoint(RGBDCamera &cam, Eigen::Vector3d pixel) {

        Eigen::Vector3d point;

        point = (cam.getR().inverse() * cam.getK().inverse() * pixel) - cam.getT();

        geometry_msgs::msg::Point point_msg;

        point_msg.x = point.x();
        point_msg.y = point.y();
        point_msg.z = point.z();

        return point;
    }
} // t24e::local_mapper::vision
Clone this wiki locally