Surface Reconstruction
3D surface reconstruction
In computer vision and computer graphics, 3D reconstruction is the process of capturing the shape and appearance of real objects. This process can be accomplished either by active or passive methods. If the model is allowed to change its shape in time, this is referred to as non-rigid or spatio-temporal reconstruction.
The research of 3D reconstruction has always been a difficult goal. By Using 3D reconstruction one can determine any object's 3D profile, as well as knowing the 3D coordinate of any point on the profile. The 3D reconstruction of objects is a generally scientific problem and core technology of a wide variety of fields, such as Computer Aided Geometric Design (CAGD), computer graphics, computer animation, computer vision, medical imaging, computational science, virtual reality, digital media, etc. For instance, the lesion information of the patients can be presented in 3D on the computer, which offers a new and accurate approach in diagnosis and thus has vital clinical value.[4] Digital elevation models can be reconstructed using methods such as airborne laser altimetry or synthetic aperture radar.
Active methods, i.e. range data methods, given the depth map, reconstruct the 3D profile by numerical approximation approach and build the object in scenario based on model. These methods actively interfere with the reconstructed object, either mechanically or radiometrically using rangefinders, in order to acquire the depth map, e.g. structured light, laser range finder and other active sensing techniques. A simple example of a mechanical method would use a depth gauge to measure a distance to a rotating object put on a turntable. More applicable radiometric methods emit radiance towards the object and then measure its reflected part. Examples range from moving light sources, colored visible light, time-of-flight lasers to microwaves or 3D ultrasound.
NeRF is one such architecture where 3D reconstruction of the object is possible, but it has its limitations.
Dynamic scenes – NeRF is designed to work with static scenes, meaning it cannot handle moving objects or scenes with dynamic lighting conditions. This limits its applicability in certain scenarios, such as augmented reality or virtual reality applications.
Scalability - NeRF requires a significant amount of computational resources, including GPU memory, to train and render high-quality 3D models. This makes it difficult to scale the technique to larger datasets or real-time applications.
Generalization to unseen scenes: NeRF may struggle to generalize to unseen scenes or objects, as it relies heavily on the training data to learn the underlying scene geometry and lighting conditions. This can be a problem in certain scenarios where the goal is to generate realistic images of novel objects or scenes.
Let's take a closer look at some key definitions before we explore further methods for estimating depth:
Disparity is a correspondence problem, and it tries to find whether a given point(x,y) present in Epipolar lines of left image is matching with a point (x-d, y) present in the right image. Disparity and distance from the cameras are inversely related.
Focal Length – of the cameras, can be found through camera calibration
Baseline – distance between two cameras
depth = (baseline * focal length) / disparity)
Epipolar Geometry: is used to relate the corresponding points in two or more images.
Triangulation - it allows us to reconstruct the 3D geometry of a scene from its 2D projections. Triangulation in 3D reconstruction is the process of determining the 3D location of a point in space using the projections of that point onto two or more images. The basic principle of triangulation is to find the intersection of the viewing rays of the camera centers with the corresponding image points.
Given the camera configuration we have in the Robotic Navigation Platform for spine surgery, Multi View Stereo Images, or MVS, is a better technique for us to recreate a 3D object (we will ignore Structure from Motion or SFM technique for now).
We will use the setup to build a dense 3D reconstruction of the images. In Dense 3D reconstruction (unlike sparse reconstruction), 3D points for every pixel/large number of points are reconstructed in order to create a dense representation of the scene or object being captured.
The output of any model depends on the major factors mentioned below:
Disparity Map Estimation.
Re-Projection Error.
Post Processing Technique.
When the cameras are at an angle to the 3D world coordinate and the images fall on the same Epipolar lines, stereo rectification is the initial stage in MVS. The third-dimensional image reconstruction relies on the disparity output, so the next step is to use a deep learning model to determine the disparity map. Therefore, let's attempt to apply the finest Deep Learning-based model. Some of which are AANet, RAFT-stereo, CREStereo, GMStereo etc. We experimented with AANet, CREStereo and GMStereo of which GM-Stereo performed the best. After which we move between frames using Rotational and Translational Matrix, then reprojecting the points using Triangulation for two camera set-up and finally working on postprocessing. Post-processing is an important step in 3D reconstruction that involves refining the output of the reconstruction algorithm to improve the accuracy and quality of the reconstructed 3D model with meshing, texturing, alignment etc.
Here are some examples of depth estimation and their projections on 3D point cloud/Mesh
Let's understand a little about GMStereo and the difference in architecture when compared to AANet
GMStereo
A unified dense correspondence matching formulation and model for three tasks.
This unified model naturally enables cross-task transfer (flow → stereo, flow → depth) since the model architecture and parameters are shared across tasks.
State-of-the-art or competitive performance on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
Strong features + parameter-free matching layers ⇒ a unified model for flow/stereo/depth.
An additional self-attention layer to propagate the high-quality predictions to unmatched regions.
AANet architecture
Note: GMStereo learns strong features with a Transformer (in particular cross-attention)
Cross task transfer
Comparison of different methods namely RAFT, CRE and GM Stereos
GMStereo produces sharper object structures than RAFT-Stereo and CREStereo.
Resources:
OpenCV: Depth Map from Stereo Images
GitHub - autonomousvision/unimatch: Unifying Flow, Stereo and Depth Estimation
Generate a 3D Mesh from an Image with Python | by Mattia Gatti | Towards Data Science