InfiniTAM: An Open Source Semantic Fusion Libary

18/07/201425/02/2018 MM Cheng

[:en]For an English version, please click here: http://www.robots.ox.ac.uk/~victor/infinitam/

中文版本请点击这里: https://mmcheng.net/zh/itam/

[:zh]

English version

1.1 它是什么？

InfiniTAM∞ 是一个开源、跨平台、实时的大范围深度信息融合与跟踪技术框架。该软件框架根据牛津大学Isis学术发明许可发布。

1.2 为什么要用它？

InfiniTAM∞ 在开发过程中遵循可扩展性、可重用性和可移植性的原则：

在一个含有单个Nvidia GTX 780Ti显卡的机器上，该方法的处理速度达到100+fps以上（根据所处理场景的不同而速度略有变化）。
现实世界的信息的描述可以利用稠密(基于完全的3D体数据的表达)或稀疏性(基于哈希索引的体数据小块)的表达。该算法框架的设计同样允许方便地切换到其它表达方式，例八叉树。
InfiniTAM∞在CPU和GPU之间实时地互换内存数据，以便能实现理论上无线空间的三维重建（不局限于GPU内存大小）。
我们为CPU和GPU的实现同时提供C++代码，并且这些代码在大多数情况下在两种实现间公用。
该框架使得集成新的模块非常容易：既可以替换现有模块，也可以扩展系统的用途。例如3D物体检测、新的跟踪器、等。
代码原生支持Windows, Linux 和 Mac OS X 下运行。
核心处理库的CPU版本无第三方依赖(可以独立运行)，GPU版本仅依赖CUDA。用户界面仅依赖于OpenGL 和 GLUT。深度信息可以来源于图像文件，或者支持OpenNI 2的摄像头。

1.3 怎样获取它？

InfiniTAM∞ 可以从这里下载。

1.4 如何引用？

如果你在你的出版物中用到了这个代码，请引用arXiv版本InfiniTAM∞ 的技术报告:

@article{2014arXiv1410.0925P,
	author = {{Prisacariu}, V.~A. and {K{"a}hler}, O. and {Cheng}, M.~M. and 
		{Valentin}, J. and {Torr}, P.~H.~S. and {Reid}, I.~D. and {Murray}, D.~W.},
	title = "{A Framework for the Volumetric Integration of Depth Images}",
	journal = {ArXiv e-prints},
	archivePrefix = "arXiv",
	eprint = {1410.0925},
	year = 2014
}

1.5 需要帮助？

如果是中文留言，请直接在该页面下方留言。您会得到及时的回复。如果是英文留言，欢迎在这里提交问题，或者通过电子邮件联系前两个作者(作者邮件地址可在pdf版本技术报告中找到)。

使用说明

正如在技术概览中所述，InfiniTAM软件包包含2个部分。大部分的代码实现在ITMLib域名下。该域名下包含了一个可以在其它应用中使用的独立的3D重建库。InfiniTAM域名包含了进一步支持图像获取和图形界面的辅助类–该部分在用户应用中很可能会被替换。最终，共享的代码中包含一个简单的测试程序，以便用户可以快速的测试整个算法框架。

2.1 编译InfiniTAM

该工程中包含了一个Microsoft Visual Studio 2012解决方案文件和一个cmake生成文件。该系统框架已在Microsoft Windows 8.1, openSUSE Linux 12.3和苹果 Mac OS X 10.9 平台下进行了测试。除了一个基本的C++软件开发平台之外，InfiniTAM依赖于以下第三方函数库：

OpenGL / GLUT (例如：freeglut 2.8.0): 在InfiniTAM域名以及示例程序中用于可视化功能。但是，ITMLib库并不依赖于它。Freeglut可以从这里下载 http://freeglut.sourceforge.net/.
NVIDIA CUDA SDK (例如：6.5版本): 用于所有GPU加速代码。然而，使用GPU加速是可选项，用户依然可以脱离对CUDA的依赖而编译纯CPU部分代码。CUDA SDK可以从这里下载 https://developer.nvidia.com/cuda-downloads.
OpenNI (例如：2.2.0.33版本): 这部分是可选的，用于实时的从深度传感器中获取深度和彩色图像序列。算法框架本身可以在没有OpenNI情况下编译运行，并从保存在磁盘中的图像序列中读取数据。OpenNI可以从这里下载 http://structure.io/openni.

最后，该算法框架包含了一个可以独立编译的Doxygen reference文档。关于编译的更多信息可以参考压缩包中的 README 文件。

2.2 运行InfiniTAM

上图所示为InfiniTAM示例应用的界面。我们实时的显示重构结果的光线跟踪渲染结果以及来自摄像头的实时深度图像和彩色图像。另外，每一帧的处理时间、键盘快捷键也在界面的下方予以显示。键盘快捷键允许用户处理下一帧、处理连续的所有帧、或者退出。其它的一些诸如从当前重构结果中导出3D模型、从任意角度渲染等功能目前尚未实现（后续会添加）。用户界面窗口和交互功能在UIEngine中实现，并依赖于OpenGL和GLUT库。

运行InfiniTAM需要深度(彩色可选)摄像头相机配准的内参数。配准参数通过一个外部文件制定，该文件的一个示例如下：

640 480
504.261 503.905
352.457 272.202

640 480
573.71 574.394
346.471 249.031

0.999749 0.00518867 0.0217975 0.0243073
-0.0051649 0.999986 -0.0011465 -0.000166518
-0.0218031 0.00103363 0.999762 0.0151706

1135.09 0.0819141

该文件包含 (i) 每个摄像头(RGB和深度)的图像大小、焦距、以像素为单位的主点位置等 (Matlab配准工具包中包含同样的参数)；(ii) 用于把RGB摄像机坐标系统转化为深度摄像头坐标系统的欧式空间变换矩阵；(iii) 将disparity值转化为深度值的变换参数。如果利用深度跟踪器，RGB摄像头的配准参数以及RGB和深度的变换参数可以忽略。如果使用从OpenNI获取的实时数据，disparity配准可以忽略。

我们也提供了两个命令行示例，用于切换两种不同的数据源：OpenNI或图像文件。这些数据源的选择根据传给InfiniTAM的参数数目不同而定。一个Linux下的命令行示例为：

 $ ./InfiniTAM Files/Teddy/calib.txt
 $ ./InfiniTAM Files/Teddy/calib.txt /Teddy/Frames/{video,depth}%04i.ppm

第一行开启InfiniTAM，并以制定的配准参数文件和实时的OpenNI数据为输入。

The first line starts InfiniTAM with the specified calibration file and live input from OpenNI, while the second uses the given calibration file and the RGB and depth images specified in the other two arguments. We tested the OpenNI input with a Microsoft Kinect for XBOX 360, with a PrimeSense Carmine 1.08 and with the Occipital Structure Sensor. For offline use with images from files, these have to be in PPM/PGM format with the RGB images being standard PPM files and the depth images being 16bit big-endian raw Kinect disparities.

2.3 Configuring InfiniTAM

All internal library settings are defined inside the ITMLibSettingsclass and they are:

/// Use GPU or run the code on the CPU instead.
bool useGPU;

/// Enables swapping between host and device.
bool useSwapping;

/// Tracker types
typedef enum {
    //! Identifies a tracker based on colour image
    TRACKER_COLOR,
    //! Identifies a tracker based on depth image
    TRACKER_ICP
} TrackerType;
/// Select the type of tracker to use
TrackerType trackerType;

/// Number of resolution levels for the tracker.
int noHierarchyLevels;

/// Number of resolution levels to track only rotation instead of full SE3.
int noRotationOnlyLevels;

/// For ITMColorTracker: skip every other point in energy function evaluation.
bool skipPoints;

/// For ITMDepthTracker: ICP distance threshold
float depthTrackerICPThreshold;

/// Further, scene specific parameters such as voxel size
ITMLib::Objects::ITMSceneParams sceneParams;

The ITMSceneParams further contain settings for the voxel size in millimetres and the truncation band of the signed distance function. Furthermore the fileITMLibDefines.h contains definitions that select the type of voxels and the voxel index used in the compilation of ITMMainEngine:

/** This chooses the information stored at each voxel. At the moment, valid
    options are ITMVoxel_s, ITMVoxel_f, ITMVoxel_s_rgb and ITMVoxel_f_rgb 
*/
typedef ITMVoxel_s ITMVoxel;

/** This chooses the way the voxels are addressed and indexed. At the moment,
    valid options are ITMVoxelBlockHash and ITMPlainVoxelArray.
*/
typedef ITMLib::Objects::ITMVoxelBlockHash ITMVoxelIndex;

For using InfiniTAM as a 3D reconstruction library in other applications, the classITMMainEngine is recommended as the main entry point to the ITMLib library. It performs the whole 3D reconstruction algorithm. Internally it stores the latest image as well as the 3D world model and it keeps track of the camera pose. The intended use is as follows:

Create an ITMMainEngine specifying the internal settings, camera parameters and image sizes.
Get the pointer to the internally stored images with the method GetView() and write new image information to that memory.
Call the method ProcessFrame() to track the camera and integrate the new information into the world model.
Optionally access the rendered reconstruction or another image for visualisation using GetImage().

The internally stored information can be accessed through member variables trackingState and scene.

技术综述

Here we provide a technical overview of the InfiniTAM framework. A pdf version, which includes all relevant citations, can be downloaded from here.

3.1 Architecture Overview

We first present an overview of the overall processing pipeline of our framework and discuss the cross device implementation architecture. These serve as the backbone of the framework throughout this report.

Processing Pipeline

The main processing steps are illustrated in the figure above. As with the typical and well known KinectFusion pipeline, there is a tracking step for localizing the camera, an integration step for updating the 3D world model and a raycasting step for rendering the 3D data in preparation of the next tracking step. To accommodate the additional processing requirements for octrees, voxel block or other data structures, an allocation step is included, where the respective data structures are updated in preparation for integrating data from a new frame. Also an optionalswapping step is included for transferring data between GPU and CPU.

A detailed discussion of each individual stage follows below. For now notice that our implementation follows the chain-of-responsibility design pattern, which means that the relevant data structures (e.g. ITMImage) and ITMScene) are passed between several processing engines (e.g. ITMSceneReconstructionEngine). The engines are stateless and each engine is responsible for one specific aspect of the overall processing pipeline. The state is passed on in objects containing the processed information. Finally one central class (ITMMainEngine) holds instances of all objects and engines and controls the flow of information.

Cross Device Implementation Architecture

InfiniTAM Cross Device Engine Architecture

Each engine is further split into three layers as shown in the figure above. The topmost, Abstract Layer is accessed by the library’s main engine and is in general just an blank interface, although some common code can be shared at this point. The interface is implemented in the Device Specific Layer, which will be very different depending on whether it runs on a CPU, a GPU, on OpenMP or other hardware acceleration architectures. In the Device Agnostic Layer there may be some inline code that is called from the higher layers and recompiled for the different architectures.

Considering the example of a tracking engine, ITMTracker, the Abstract Layercontains code for the generic optimisation of an error function, the Device Specific Layer contains a loop or GPU kernel call to evaluate the error function for all pixels in an image, and the Device Agnostic Layer contains a simple inline C-function to evaluate the error in a single pixel.

3.2 Volumetric Representation

The volumetric representation is the central data structure within the InfiniTAM framework. It is used to store a truncated signed distance function that implicitly represents the 3D geometry by indicating for each point in space how far in front or behind the scene surface it is.

Being truncated, only a small band of values around the current surface estimate is actually being stored, while beyond the band the values are set to a maximum and minimum value, respectively.

The data structure used for this representation crucially defines the memory efficiency of the 3D scene model and the computational efficiency of interacting with this model. We therefore keep it modular and interchangeable in our framework and provide two implementations, namely for a fixed size dense volume (ITMPlainVoxelArray) and voxel block hashing (ITMVoxelBlockHash). Both are described in the following, but first we briefly discuss how the representation is kept interchangeable throughout the framework.

3.2.1 Framework Flexibility

Core data classes dealing with the volumetric signed distance function are templated on (i) the type of voxel information stored, and (ii) the data structure used for indexing and accessing the voxels. The relevant engine classes are equally templated to provide the specific implementations depending on the choosen data structures.

An example for a class representing the actual information at each voxel is given in the following listing:

struct ITMVoxel_s_rgb
{
    _CPU_AND_GPU_CODE_ static short SDF_initialValue() { return 32767; }
    _CPU_AND_GPU_CODE_ static float SDF_valueToFloat(float x)
    { return (float)(x) / 32767.0f; }
    _CPU_AND_GPU_CODE_ static short SDF_floatToValue(float x)
    { return (short)((x) * 32767.0f); }

    static const bool hasColorInformation = true;

    /** Value of the truncated signed distance transformation. */
    short sdf;
    /** Number of fused observations that make up @p sdf. */
    uchar w_depth;
    /** RGB colour information stored for this voxel. */
    Vector3u clr;
    /** Number of observations that made up @p clr. */
    uchar w_color;

    _CPU_AND_GPU_CODE_ ITMVoxel_s_rgb()
    {
        sdf = SDF_initialValue();
        w_depth = 0;
        clr = (uchar)0;
        w_color = 0;
    }
};

where the macro _CPU_AND_GPU_CODE_ identifies methods and functions that can be run both as host and as device code and is defined as:

#if defined(__CUDACC__) && defined(__CUDA_ARCH__)
#define _CPU_AND_GPU_CODE_ __device__   // for CUDA device code
#else
#define _CPU_AND_GPU_CODE_ 
#endif

Alternative voxel types provided along with the current implementation are based on floating point values instead of short, or they do not contain colour information. Note that the member hasColorInformation is used in the provided integration methods to decide whether or not to gather colour information, which has an impact on the processing speed accordingly.

Two examples for the index data structures that allow accessing the voxels are given below. Again, note that the processing engines choose different implementations according to the selected indexing class. For example the allocation step for voxel block hashing has to modify a hash table, whereas for a dense voxel array it can return immediately and does not have to do anything.

3.2.2 Dense Volumes

We first discuss the naive way of using a dense volume of limited size, as presented in the earlier KinectFusion works. This is very well suited for understanding the basic workings of the algorithms, it is trivial to parallelise on the GPU, and it is sufficient for small scale reconstruction tasks.

In the InfiniTAM framework the class ITMPlainVoxelArray is used to represent dense volumes and a simplified listing of this class is given below:

class ITMPlainVoxelArray
{
    public:
    struct ITMVoxelArrayInfo {
        /// Size in voxels
        Vector3i size;
        /// offset of the lower left front corner of the volume in voxels
        Vector3i offset;

        ITMVoxelArrayInfo(void)
        {
            size.x = size.y = size.z = 512;
            offset.x = -256;
            offset.y = -256;
            offset.z = 0;
        }
    };

    typedef ITMVoxelArrayInfo IndexData;

    private:
    IndexData indexData_host;

    public:
    ITMPlainVoxelArray(bool allocateGPU)
    ...

    ~ITMPlainVoxelArray(void)
    ...

    /** Maximum number of total entries. */
    int getNumVoxelBlocks(void) { return 1; }
    int getVoxelBlockSize(void) { return indexData_host.size.x * indexData_host.size.y * indexData_host.size.z; }

    const Vector3i getVolumeSize(void) { return indexData_host.size; }

    const IndexData* getIndexData(void) const
    ...
};

Note that the subtype IndexData as well as the methods getIndexData(),getNumVoxelBlocks() and getVoxelBlockSize() are used extensively within the processing engines and on top of that the methods getNumVoxelBlocks() and getVoxelBlockSize() are used to determine the overall number of voxels that have to be allocated.

Depending on the choice of voxel type, each voxel requires at least 3 bytes and, depending on the available memory a total volume of about 768×768×768 voxels is typically the upper limit for this dense representation. To make the most of this limited size, the initial camera is typically placed towards the centre of the rear plane of this voxel cube, as identified by the offset in our implementation, so that there is maximum overlap between the view frustum and the volume.

3.2.3 Voxel Block Hashing

The key idea for scaling the SDF based representation to larger 3D environments is to drop the empty voxels from outside the truncation band and to represent only the relevant parts of the volume. In Niessner et al this is achieved using a hash lookup of subblocks of the volume. Our framework provides an implementation of this method that we explain in the following.

Voxels are grouped in blocks of predefined size (currently 8×8×8 voxels). All the voxel blocks are stored in a contiguous array, referred henceforth as the voxel block array or VBA. In the current implementation this has a defined size of 218elements.

To handle the voxel blocks we further need:

Hash Table and Hashing Function: Enable fast access to voxel blocks in the voxel block array.
Hash Table Operations: Insertion, retrieval and deletion of voxel.

Hash Table and Hashing Function

To quickly and efficiently find the position of a certain voxel block in the voxel block array, we use a hash table. This hash table is a contiguous array ofITMHashEntry objects of the following form:

struct ITMHashEntry
{
    /** Position of the corner of the 8x8x8 volume, that identifies the entry. */
    Vector3s pos;
    /** Offset in the excess list. */
    int offset;
    /** Pointer to the voxel block array.
        - >= 0 identifies an actual allocated entry in the voxel block array
        - -1 identifies an entry that has been removed (swapped out)
        - <-1 identifies an unallocated block
    */
    int ptr;
};

The hash function hashIndex for locating entries in the hash table takes the corner coordinates vVoxPos of a 3D voxel block and computes an index as follows Niessner et al:

template<typename T> _CPU_AND_GPU_CODE_ inline int hashIndex(const InfiniTAM::Vector3<T> vVoxPos, const int hashMask){
    return ((uint)(((uint)vVoxPos.x * 73856093) ^ ((uint)vVoxPos.y * 19349669) ^ ((uint)vVoxPos.z * 83492791)) & (uint)hashMask);
}

To deal with hash collisions, each hash index points to a bucket of fixed size (typically 2), which we consider the ordered part of the hash table. There is an additional unordered excess list that is used once an ordered bucket fills up. In either case, each ITMHashEntry in the hash table stores an offset in the voxel block array and can hence be used to localise the voxel data for this specific voxel block. This overall structure is illustrated in the figure below.

Hash Table Operations

The three main operations used when working with a hash table are the insertion,retrieval and removal of entries. In the current version of InfiniTAM we support the former two, with removal not currently required or implemented. The code used by the retrieval operation is shown below:

template<class TVoxel>
_CPU_AND_GPU_CODE_ inline TVoxel readVoxel(const TVoxel *voxelData, const ITMVoxelBlockHash::IndexData *voxelIndex, const Vector3i & point, bool &isFound)
{
    const ITMHashEntry *hashTable = voxelIndex->entries_all;
    const TVoxel *localVBA = voxelData;
    TVoxel result; Vector3i blockPos; int offsetExcess;

    int linearIdx = vVoxPosParse(point, blockPos);
    int hashIdx = hashIndex(blockPos, SDF_HASH_MASK) * SDF_ENTRY_NUM_PER_BUCKET;

    isFound = false;

    //check ordered list
    for (int inBucketIdx = 0; inBucketIdx < SDF_ENTRY_NUM_PER_BUCKET; inBucketIdx++) 
    {
        const ITMHashEntry &hashEntry = hashTable[hashIdx + inBucketIdx];
        offsetExcess = hashEntry.offset - 1;

        if (hashEntry.pos == blockPos && hashEntry.ptr >= 0)
        {
            result = localVBA[(hashEntry.ptr * SDF_BLOCK_SIZE3) + linearIdx];
            isFound = true;
            return result;
        }
    }

    //check excess list
    while (offsetExcess >= 0)
    {
        const ITMHashEntry &hashEntry = hashTable[SDF_BUCKET_NUM * SDF_ENTRY_NUM_PER_BUCKET + offsetExcess];

        if (hashEntry.pos == blockPos && hashEntry.ptr >= 0)
        {
            result = localVBA[(hashEntry.ptr * SDF_BLOCK_SIZE3) + linearIdx];
            isFound = true;
            return result;
        }

        offsetExcess = hashEntry.offset - 1;
    }

    return result;
}

Both insertion and retrieval work by iterating through the elements of the list stored within the hash table. Given a target 3D voxel location in world coordinates, we first compute its corresponding voxel block location, by dividing the voxel location by the size of the voxel blocks. Next, we call the hashing function hashIndex to compute the index of the bucket from the ordered part of the hash table. All elements in the bucket are then checked, with retrieval looking for the target block location and insertion for an unallocated hash entry. If this is found, retrieval returns the voxel stored at the target location within the block addressed by the hash entry. Insertion (i) reserves a block inside the voxel block array and (ii) populates the hash table with a new entry containing the reserved voxel block array address and target block 3D world coordinate location.

If all locations in the bucket are exhausted, the enumeration of the list moves to the linked list in the unordered part of the hash table, using the offset field to provide the location of the next hash entry. The enumeration finishes when offsetis found to be smaller or equal to−1. At this point, if the target location still has not been found, retrieval returns an empty voxel. Insertion (i) reserves an unallocated entry in the unordered part of the hash table and a block inside the voxel block array, (ii) populates the hash table with a new entry containing the reserved voxel block array address and target block 3D world coordinate location and (iii) changes the offset field in the previous entry in the linked list to point to the newly populated one.

The reserve operations used for the unordered part of the hash table and for the voxel block array use prepopulated allocation lists and, in the GPU code, atomic operations.

All hash table operations are done through these functions and there is no direct memory access encouraged or indeed permitted by the current version of the code.

3.3 Individual Method Stages

A general outline of the InfiniTAM processing pipeline has already been given above. Details of each processing stage will be discussed in the following. The distinct stages are implemented using individual engines and a simplified diagram of the corresponding classes and their collaborations is given in the figure below.

Simplified diagram of InfiniTAM namespaces, classes and collaborations.

The stages are:

Tracking: The camera pose for the new frame is obtained by fitting the current depth (or optionally colour) image to the projection of the world model from the previous frame. This is implemented using the ITMTracker,ITMColorTracker and ITMDepthTracker engines.
Allocation: Based on the depth image, new voxel blocks are allocated as required and a list of all visible voxel blocks is built. This is implemented inside the ITMSceneReconstructionEngine.
Integration: The current depth and colour frames are integrated within the map. This is implemented inside the ITMSceneReconstructionEngine.
Swapping In and Out: If required, map data is swapped in from host memory to device memory and merged with the present data. Parts of the map that are not required are swapped out from device memory to host memory. This is implemented using the ITMSwappingEngine and ITMGlobalCache.
Raycasting: The world model is rendered from the current pose (i) for visualisation purposes and (ii) to be used by the tracking stage at the next frame. This uses ITMSceneReconstructionEngine.

As illustrated in above, the main processing engines are contained within theITMLib namespace. Apart from these, the UI and image acquisition engines (UIEngine,ImageSourceEngine and OpenNIEngine) are contained in the InfiniTAM namespace.

The ITMLib namespace also contains the two additional engine classesITMLowLevelEngine and ITMVisualisationEngine that we do not discuss in detail. These are used, respectively, for low level image processing (e.g. gradients and computation of a resolution hierarchy) and for visualisation (currently only a conversion of depth images to uchar4.

In this section we discus the tracking, allocation, integration and raycasting stages in greater detail. We delay a discussion of the swapping until later on the page.

3.3.1 Tracking

In the tracking stage the camera pose consisting of the rotation matrix R and translation vector t has to be determined given the RGB-D image and the current 3D world model. Along with InfiniTAM we provide the two engines ITMDepthTracker, that performs tracking based on the new depth image, and ITMColorTracker, that is using the colour image. Both of these implement the abstract ITMTracker class and have implementations running on the CPU and on CUDA.

In the ITMDepthTracker we follow the original alignment process as described in the KinectFusion papers:

Render a map V of surface points and a map N of surface normals from the viewpoint of an initial guess for R and t.
Project all points p from the depth image onto points p¯ in V and N and compute their distances from the planar approximation of the surface, i.e. d=(Rp+t−V(p¯))TN(p¯).
Find R and t minimising the linearised sum of the squared distances by solving a linear equation system.
Iterate the previous two steps until convergence.

A resolution hierarchy of the depth image is used in our implementation to improve the convergence behaviour.

Alternatively the colour image can be used within an ITMColorTracker. In this case the alignment process is as follows:

Create a list V of surface points and a corresponding list C of colours from the viewpoint of an initial guess.
Project all points from V into the current colour image I and compute the Euclidean norm of the difference in colours, i.e. d=∥I(π(RV(i)+t))−C(i)∥2.
Find R and t minimising the sum of the squared differences using the Levenberg-Marquardt optimisation algorithm.

Again a resolution hierarchy in the colour image is used and the list of surface points is subsampled by a factor of 4. A flag in ITMLibSettings allows to select which tracker is used and the default is the ITMDepthTracker.

The two main classes ITMDepthTracker and ITMColorTracker actually only implement a shared optimisation framework, including e.g. the Levenberg-Marquardt algorithm and solving the linear equation systems. These are always running on the CPU. Meanwhile the evaluation of the error function value, gradient and Hessian is implemented in derived, CPU and CUDA specific classes and makes use of parallelisation.

3.3.2 Allocation

In the allocation stage the underlying data structure for the representation of the truncated signed distance function is prepared and updated, so that a new depth image can be integrated. In the simple case of a dense voxel grid, the allocation stage does nothing. In contrast, for voxel block hashing the goal is to find all the voxel blocks affected by the current depth image and to make sure that they are allocated in the hash table Niessner et al.

In our implementation of voxel block hashing, the aim was to minimise the use of blocking operations (e.g. atomics) and to completely avoid the use of critical sections. This has led us to doing the processing three separate stages, as we explain in the following.

In the first stage we project each pixel from the depth image to 3D space and create a line segment along the ray from depth d−μ to d+μ, where d is the measured depth at the pixel and μ is the width of the truncation band of the signed distance function. This line segment intersects a number of voxel blocks, and we search for these voxel blocks in the hash table. If one of the blocks is not allocated yet, we find a free hash bucket space for it. As a result for the next stage we create two arrays, each of the same size as the number of elements in the hash table. The first array contains a bool indicating the visibility of the voxel block referenced by the hash table entry, the second contains information about new allocations that have to be performed. Note that this requires only simple, non-atomic writes and if more than one new block has to be allocated with the same hash index, only the most recently written allocation will actually be performed. We tolerate such artefacts from intra-frame hash collisions, as they will be corrected in the next frame automatically for small intra-frame camera motions.

In the second stage we allocate voxel blocks for each non-zero entry in the allocation array that we built previously. This is done using a single atomic subtraction on a stack of free voxel block indices i.e. we decrease the number of remaining blocks by one and add the previous head of the stack to the hash entry.

In the third stage we build a list of live voxel blocks, i.e. a list of the blocks that project inside the visible view frustum. This is later going to be used by the integration and swapping stages.

3.3.4 Integration

In the integration stage, the information from the most recent RGB-D image is incorporated into the 3D world model. In case of a dense voxel grid this is identical to the integration in the original KinectFusion algorithm and for voxel block hashing the changes are minimal after the list of visible voxel blocks has been created in the allocation step.

For each voxel in any of the visible voxel blocks, or for each voxel in the whole volume for dense grids, the function computeUpdatedVoxelDepthInfo is called. If a voxel is behind the surface observed in the new depth image by more than the truncation band of the signed distance function, the image does not contain any new information about this voxel, and the function returns without writing any changes. If the voxel is close to or in front of the observed surface, a corresponding observation is added to the accumulated sum. This is illustrated in the listing of the function computeUpdatedVoxelDepthInfo below, and there is a similar function in the code that additionally updates the colour information of the voxels.

template<class TVoxel>
_CPU_AND_GPU_CODE_ inline float computeUpdatedVoxelDepthInfo(TVoxel &voxel, Vector4f pt_model, Matrix4f M_d, Vector4f projParams_d,
    float mu, int maxW, float *depth, Vector2i imgSize)
{
    Vector4f pt_camera; Vector2f pt_image;
    float depth_measure, eta, oldF, newF;
    int oldW, newW;

    // project point into image
    pt_camera = M_d * pt_model;
    if (pt_camera.z <= 0) return -1;

    pt_image.x = projParams_d.x * pt_camera.x / pt_camera.z + projParams_d.z;
    pt_image.y = projParams_d.y * pt_camera.y / pt_camera.z + projParams_d.w;
    if ((pt_image.x < 1) || (pt_image.x > imgSize.x - 2) || (pt_image.y < 1) || (pt_image.y > imgSize.y - 2)) return - 1;

    // get measured depth from image
    depth_measure = depth[(int)(pt_image.x + 0.5f) + (int)(pt_image.y + 0.5f) * imgSize.x];
    if (depth_measure <= 0.0) return -1;

    // check whether voxel needs updating
    eta = depth_measure - pt_camera.z;
    if (eta < -mu) return eta;

    // compute updated SDF value and reliability
    oldF = TVoxel::SDF_valueToFloat(voxel.sdf); oldW = voxel.w_depth;
    newF = MIN(1.0f, eta / mu);
    newW = 1;

    newF = oldW * oldF + newW * newF;
    newW = oldW + newW;
    newF /= newW;
    newW = MIN(newW, maxW);

    // write back
    voxel.sdf = TVoxel::SDF_floatToValue(newF);
    voxel.w_depth = newW;

    return eta;
}

The main difference between the dense voxel grid and the voxel block hashing representations is that the aforementioned update function is called for a different number of voxels and from within different loop constructs.

3.3.5 Raycast

As the last step in the pipeline, an image is computed from the updated 3D world model to provide input for the tracking at the next frame. This image can also be used for visualisation. The main process underlying this rendering is raycasting, i.e. for each pixel in the image a ray is being cast from the camera up until an intersection with the surface is found. This essentially means checking the value of the truncated signed distance function at each voxel along the ray until a zero-crossing is found, and the same raycasting engine can be used for a range of different representations, as long as an appropriate readVoxel() function is called for reading values from the SDF.

As noted in the original KinectFusion paper, the performance of the raycasting can be improved significantly by taking larger steps along the ray. The value of the truncated signed distance function can serve as a conservative estimate for the distance to the nearest surface, hence this value can be used as step size. To additionally handle empty space in the volumetric representation, where no corresponding voxel block has been allocated, we introduce a state machine with the following states:

enum {
    SEARCH_BLOCK_COARSE,
    SEARCH_BLOCK_FINE,
    SEARCH_SURFACE,
    BEHIND_SURFACE,
    WRONG_SIDE
} state;

Starting from SEARCH_BLOCK_COARSE, we take steps of the size of each block, i.e. 8voxels, until an actually allocated block is encountered. Once the ray enters an allocated block, we take a step back and enter state SEARCH_BLOCK_FINE, indicating that the step length is now limited by the truncation band μ of the signed distance function. Once we enter a valid block and the values in that block indicate we are still in front of the surface, the state is changed to SEARCH_SURFACE until a negative value is read from the signed distance function, which indicates we are now in state BEHIND_SURFACE. This terminates the raycasting iteration and the exact location of the surface is now found using two trilinear interpolation steps. The state WRONG_SIDE is entered if we are searching for a valid block in stateSEARCH_BLOCK_FINE and encounter negative SDF values, indicating we are behind the surface as soon as we enter a block. In this case the ray hits the surface from behind for whichever reason, and we do not want to count the boundary between the unallocated, empty space and the block with the negative values as an object surface.

Another measure for improving the performance of the raycasting is to select a plausible search range. If a sparse voxel block representation is used, then we are given a list of visible blocks from the allocation step, and we can render these blocks by forward projection to give us an idea of the maximum and minimum depth values to expect at each pixel. Within InfiniTAM this can be done using anITMBlockProjectionEngine. A naive implementation on the CPU computes the 2D bounding box of the projection of each voxel block into the image and fills this area with the maximum and minimum depth values of the corresponding 3D bounding box of the voxel block, correctly handling overlapping bounding boxes, of course.

To parallelise this process on the GPU we split it into two steps. First we project each block down into the image, compute the bounding box, and create a list of 16×16 pixel fragments, that are to be filled with specific minimum and maximum depth values. Apart from a prefix sum to count the number of fragments, this is trivially parallelisable. Second we go through the list of fragments and actually render them. Updating the minimum and maximum depth for a pixel requires atomic operations, but by splitting the process into fragments we reduce the number of collisions to typically a few hundreds or thousands of pixels in a 640×480 image and achieve an efficiently parallelised overall process.

3.3.6 Swapping

Voxel hashing already enables much larger maps to be created, compared to the much simpler dense 3D volumes. Video card memory capacity however is often quite limited. Practically an off-the-shelf video card can roughly hold the map of a single room at 4mm voxel resolution in active memory, even with voxel hashing. This problem can be mitigated using a traditional method from the graphics community, e.g. the swapping of data between host and device memories. We only hold the activepart of the map in video card memory, i.e. only parts that are inside or close to the current view frustum. The remainder of the map is swapped out to host memory and swapped back in as needed.

We have designed our swapping framework aiming for the following three objectives: (O1) the transfers between host and device should be minimised and have guaranteed maximum bounds, (O2) host processing time should be kept to a minimum and (O3) no assumptions should be made about the type and speed of the host memory, i.e. it could be a hard drive. These objectives lead to the following design considerations:

O1: All memory transfers use a host/device buffer of fixed user-defined size.
O2: The host map memory is configured as a voxel block array of size equal to the number of entries in the hash table. Therefore, to check if a hash entry has a corresponding voxel block in the host memory, only the hash table index needs to be transferred and checked. The host does not need to perform any further computations, e.g. as it would have to do if a separate host hash table were used. Furthermore, whenever a voxel block is deallocated from device memory, its corresponding hash entry is not deleted but rather marked as unavailable in device memory, and, implicitly, available in host memory. This (i) helps maintain consistency between device hash table and host voxel block storage and (ii) enables a fast visibility check for the parts of the map stored only in host memory.
O3: Global memory speed is a function of the type of storage device used, e.g. faster for RAM and slower for flash or hard drive storage. This means that, for certain configurations, host memory operations can be considerably slower than the device reconstruction. To account for this behaviour and to enable stable tracking, the device is constantly integrating new live depth data even for parts of the scene that are known to have host data that is not yet in device memory. This might mean that, by the time all visible parts of the scene have been swapped into the device memory, some voxel blocks might hold large amounts of new data integrated by the device. We could replace the newly fused data with the old one from the host stored map, but this would mean disregarding perfectly fine map data. Instead, after the swapping operation, we run a secondary integration that fuses the host voxel block data with the newly fused device map.

The design considerations have led us to the swapping in/out pipeline shown in the figure below. We use the allocation stage to establish which parts of the map need to be swapped in, and the integration stage to mark which parts need to swapped out. A voxel needs to be swapped (i) from host once it projects within a small (tunable) distance from the boundaries of live visible frame and (ii) to disk after data has been integrated from the depth camera.

The swapping in stage is exemplified for a single block in the figure below. The indices of the hash entries that need to be swapped in are copied into the device transfer buffer, up to its capacity. Next, this is transferred to the host transfer buffer. There the indices are used as addresses inside the host voxel block array and the target blocks are copied to the host transfer buffer. Finally, the host transfer buffer is copied to the device where a single kernel integrates directly from the transfer buffer into the device voxel block memory.

Swapping in: First, the hash table entry at address 1 is copied into the device transfer buffer at address 2. This is then copied at address 3 in the host transfer buffer and used as an address inside the host voxel block array, indicating the block at address 4. This block is finally copied back to location 7 inside the device voxel block array, passing through the host transfer buffer (location 5) and the device transfer buffer (location 6).

An example for the swapping out stage is shown in the figure below for a single block. Both indices and voxel blocks that need to be swapped out are copied to the device transfer buffer. This is then copied to the host transfer buffer memory and again to host voxel memory.

Swapping Out: The hash table entry at location 1 and the voxel block at location 3 are copied into the device transfer buffer at locations 2 and 4, respectively. The entire used transfer buffer is then copied to the host, at locations 5, and the hash entry index is used to copy the voxel block into location 7 inside the host voxel block array.

All swapping related variables and memory is kept inside the ITMGlobalCache object and all swapping related operations are done by the ITMSwappingEngine.

Closely related projects

http://apt.cs.manchester.ac.uk/projects/PAMELA/[:]

(Visited 23,337 times, 1 visits today)

33 Comments

Inline Feedbacks

View all comments

Chen

3 years ago

程老师您好！我使用intel realsense D435i相机，由一个双目深度模块和一个RGB模块组成，需要配准的参数和我在RealSense2Engine.cpp里面看到的差别较大，请问我的标定文档的格式应该怎么写呢？谢谢！

Saki

Reply to Chen

2 years ago

同问，我目前也正在做这个，能否一起交流下

Natasha

5 years ago

您好，

请问下要如何call那个独立的3d重建库啊？因为我现在的情况是想要只reconstruct图片中的某一块，我有那一块区域的坐标，不知道可不可以直接reconstruct那块坐标的区域或是得到那块2d坐标所对应的voxel？

谢谢！

杨明辉

你好程老师，我是在Android手机上运行的这个程序，现在使用的是TrackerConfig是下面这个，发现Track过程大概需要38秒左右，不知道有什么还得方法能够改善呢，非常感谢您的回答。
Android手机是MI MAX3. CPU是骁龙636.

MM Cheng

Author

Reply to 杨明辉

手机的话，光用CPU肯定太慢了。得充分利用GPU资源

Reply to MM Cheng

您好，程老师，感谢您的回答，有一个问题想追问一下。
如果Android系统手机想利用GPU资源，是不是得支持CUDA或者Metal啊，就是说device_Type得是DEVICE_CUDA。
目前我是这么理解的。如果不是必须支持CUDA的，没太弄明白怎么利用GPU呢？如果您有时间的话，希望能指点一下，我自己也再研究一下。非常感谢您。

最好是支持cuda的，不然得自己写代码了

了解了，非常感谢您的回答。

Demon

你好，请问一下，这个开源库可以修改成实时的读取视频流吗？

Reply to Demon

可以呀。注意RGB和depth的同步就行。

Timothy

您好，想问一下，使用在线数据的时候，能同时把数据保存到本地吗？

Reply to Timothy

你自己写程序保存吧。不过硬盘速度差的话，很影响程序效率/实时性

王雪怡

7 years ago

你好程老师，我运行infiniTAM代码的时候老是出现1>UIEngine.obj : error LNK2019: 无法解析的外部符号 __imp_glBegin的错误。请问老师知道原因吗？

cnlnz

8 years ago

此程序支持 kinect 1 吗？

LEE

9 years ago

程老师你好测试了几组数据效果不错但是 infiniTam的跟踪是一旦丢失就回不来了么？还是说是我的设置有问题? 望解答

刘永强

程老师，我运行了您的InfiniTAM代码，但是我遇到一个小问题，就是如何将图像数据集全部导入项目中，现在只能处理一个图像，按下n键也不出现下一个图像，请问如何将所有的图像全部导入项目中

肖泽东

Reply to 刘永强

我试过例程，在windows平台VS2012下，我稍微修改了文件路径的输入方式，不是从dos命令行指定，而是直接在main函数中添加具体路径，没有出现问题。刚开始我也遇到你类似的问题，可能是因为windows下文件路径间隔“\”的问题。
我修改的主要部分代码如下：
const char *calibFile = “E:\\InfiniTAM\\teddy_20141003\\Files\\Teddy\\calib.txt”;
const char *imagesource_part1 = “E:\\InfiniTAM\\teddy_20141003\\Files\\Teddy\\Frames\\%04i.ppm”;
const char *imagesource_part2 = “E:\\InfiniTAM\\teddy_20141003\\Files\\Teddy\\Frames\\%04i.pgm”;

李航

Reply to 肖泽东

您的每帧处理时间是多少?感觉在我的环境下处理时间过长了，是GTX770的显卡

Reply to 李航

我的也比较慢，每秒处理3帧左右。显卡是AMD Radeon HD 7400

AMD的显卡不能运行CUDA加速啊，您方便留下邮箱么我想和您讨论一下

我是没有用CUDA，程序支持只用CPU处理。欢迎交流，互相学习。邮箱xzd1575@gmail.com

初学者

你好，你们的测试图集从哪下载的

学习！

Gang

SLAM，高大上啊，Prof. Cheng威武。请教您下，生成的模型能直接导出或者通过接口导出为其他格式吗？远距离（>10m）的深度图像三维重建效果如何？谢谢您啦！

Reply to Gang

跟Active Vision Group的Victor他们一起做的，他们组是国际上做SLAM最好的。我们另外一个版本支持导出obj格式，这个版本目前还不支持，后续会。重建效果视觉上看起来没什么问题，更精细的还是自己试试就知道了。

谢谢！期待能导出obj的版本，这样可以进一步开展其他的三维重建结果应用实验。

张佳伟

现在有可以导出obj的版本了吗

Reply to 张佳伟

暂时还没有。

你好，程老师。这个代码的测试图集在哪可以下载到？

Reply to 初学者

http://www.robots.ox.ac.uk/~victor/infinitam/files/teddy_20141003.zip

谢谢老师~

guanshiyuan

请问有没有带颜色的版本？

刘

程老师你好，我想问一下这句话是指什么，number of resolution levels for the tracker

wpDiscuz