Faiss和Rapidsai_Raft使用记录

最近在做基于图的近似向量检索的实验，需要用到Faiss库和Rapids系列的Raft库，同时由于要统计一些算法内部的数据，因此不能直接使用它们预编译的Python库，而要手动从源码编译并通过C++调用，这里记录一下编译运行时遇到的一些问题和技巧，其中Raft的坑尤其多。

Faiss

编译

Faiss库编译比较简单，按照官网的教程即可。不过使用默认cmake开关编译出来的效率比较低，如果要和预编译的Python库对齐的话需要手动指定cmake开关。首先需要安装Intel的矩阵库MKL，按照这个网站的操作来即可，因为系统是Ubuntu，所以我选择的是APT安装方式，按照网站添加了源然后apt install intel-oneapi-mkl-devel。然后我使用的cmake开关是：

前四个和效率没什么关系，看你需不需要编译GPU、Python库、测试代码和动态库。-DFAISS_OPT_LEVEL=avx512用来指定距离计算时用什么级别的向量化指令，需要根据你的CPU架构而定，我用的服务器支持AVX512，可以通过cat /proc/cpuinfo查看你的CPU支持情况，一般AVX2都是支持的，所以可以设置为`-DFAISS_OPT_LEVEL=avx2。之后正常make就可以了，没什么坑。

使用

其中，-I跟的是faiss源码所在目录，-L跟的是编译出来的libfaiss.a所在目录，如果你前面make install了就不用加这两个参数。-lfaiss_avx512和上面编译faiss时的cmake开关是对应的，如果选择了AVX512，这里就可以链接libfaiss_avx512.a，如果选择了AVX2，则得链接libfaiss_avx2.a，即使用参数-lfaiss_avx2。当然，不管选择哪个cmake开关，都会默认编译出一个没用AVX的版本libfaiss.a，所以使用-lfaiss也可以，尽管效率会比较低。

第三行的是Intel MKL的编译参数，以上只是我个人的选项，建议通过这个网站自动生成。注意两点，一个是Select interface layer这一栏，一定要选C API with 32-bit integer，不然可能会报错；另一个是Select OpenMP library根据你用的OpenMP库来选，如果你装过libiomp就可以选Intel的OpenMP库，不过我比较过它和GNU的OpenMP在我用的Faiss算法上效率差不多，所以用哪个应该都无所谓。生成完了用Use this link line里的内容替换上面的第三行。

技巧

这里介绍三个技巧，前两个针对预编译的Python库，第三个针对C++库：

#include <faiss/IndexFlat.h>
#include <faiss/IndexHNSW.h>
#include <faiss/utils/distances.h>

long long total_dist;

// 和faiss/IndexFlat.cpp里的实现基本一致，仅仅删去了一些没用到的代码并添加了total_dist的统计
struct FlatL2Dis : FlatCodesDistanceComputer {
    size_t d;
    idx_t nb;
    const float* q;
    const float* b;

    float distance_to_code(const uint8_t* code) final {
        total_dist++;
        return fvec_L2sqr(q, (float*)code, d);
    }
    float symmetric_dis(idx_t i, idx_t j) override {
        total_dist++;
        return fvec_L2sqr(b + j * d, b + i * d, d);
    }
    explicit FlatL2Dis(const IndexFlat& storage, const float* q = nullptr):
        FlatCodesDistanceComputer(storage.codes.data(), storage.code_size),
        d(storage.d), nb(storage.ntotal), q(q), b(storage.get_xb()) {}
    void set_query(const float* x) override {
        q = x;
    }
};

// 因为IndexFlat定义在faiss/IndexFlat.h头文件里，所以可以直接继承IndexFlat，只重载get_FlatCodesDistanceComputer
struct IndexFlatMy: IndexFlat {
    explicit IndexFlatMy(idx_t d): IndexFlat(d, METRIC_L2) {}
    IndexFlatMy() {}
    FlatCodesDistanceComputer* get_FlatCodesDistanceComputer() const override {
        return new FlatL2Dis(*this);
    }
};

// ...

IndexFlatMy indexflatmy(d);
IndexHNSW index(&indexflatmy, 32);

total_dist = 0;
index.add(myn, traindata.data());
printf("%lld\n", total_dist);

Raft

Raft这个库，主要是为了使用其中GNND和CAGRA算法，坑多得不可理喻，我尝试了好几次才最终成功。

编译

编译就是一个大坑，因为直接按文档里的指示是编译不出来的😅。按照文档的指示，Raft库有三种用法：

运行

运行也是一个大坑，这里直接给出我执行CAGRA两阶段的代码：

#include <vector>
#include <raft/core/copy.cuh>
#include <raft/core/mdspan.hpp>
#include <raft/neighbors/cagra.cuh>

using namespace raft::neighbors;

// 读取SIFT1M数据文件的函数
template<typename T> int vecs_read(const std::string &filename, std::vector<T> &out, long long cnt) {
    std::ifstream f(filename, std::ios::binary); if (!f) return -1;
    int dim; f.read((char *)&dim, 4); out.resize(cnt * dim); f.seekg(0, std::ios::beg);
    for (long long i = 0; i < cnt; i++) {
        f.seekg(4, std::ios::cur); f.read((char *)(out.data() + i * dim), dim * sizeof(T));
    }
    return dim;
}

int main(int argc, char** argv) {

    int myn = 1000000;
    std::vector<float> traindata;
    int d = vecs_read(argv[1], traindata, myn); printf("%d\n", d);
    raft::device_resources dev_resources;

    // 定义host矩阵，然后将读取的向量数据库传过去，再定义device矩阵，最后将host矩阵的内容复制到device矩阵
    auto host_dataset = raft::make_host_matrix<float, int64_t>(dev_resources, myn, d);
    for (int i = 0; i < myn; i++)
        for (int j = 0; j < d; j++)
            host_dataset(i, j) = traindata[i * d + j];
    auto dataset = raft::make_device_matrix<float, int64_t>(dev_resources, myn, d);
    raft::copy(dev_resources, dataset.view(), host_dataset.view());

    // 我用的是使用GNND构建CAGRA一阶段KNN图的方式
    experimental::nn_descent::index_params build_params;
    build_params.graph_degree = 64;
    auto knn_graph = raft::make_host_matrix<int64_t, int64_t>(myn, 64);
    cagra::build_knn_graph(dev_resources, raft::make_const_mdspan(dataset.view()), knn_graph.view(), build_params);

    // 二阶段优化
    auto optimized_graph = raft::make_host_matrix<int64_t, int64_t>(myn, 32);
    cagra::optimize(dev_resources, knn_graph.view(), optimized_graph.view());

    return 0;
}

有两个坑点，第一个是如何将读取的向量数据库传到设备中，我查了偌大的文档居然没一处提到。demo再这么说也得用一个比较实际的向量数据库作示例吧，竟然用随机生成的数据库。这个库还专门为随机生成向量数据库提供了一套API，而没有提供和vector还是传统数组交互的API，真的是无语😅😅😅……虽然也有可能是我才疏学浅没找到API，但就像我说的，要是真有这种API，正常就应该在最明显的地方（比如README）给个示例，最好是用SIFT1M这种常用的benchmark，然后把读取、训练、查询明明白白的展示出来。最后没办法，仿照这个写了个暴力复制。个人猜测host_matrix和device_matrix应该是有API能够暴露指针的，然后通过memcpy或者cudaMemcpy倒腾数据，因为Python库可以直接用pytorch或者cupy的张量作接口，但我确实是没在C++这边找到。（更新：看了库里几个算法的实现，这个方法应该是.data_handle()，返回数据模板类型的指针，但我还没有测试）

第二个坑点是cagra::build_knn_graph和cagra::optimize的模板类型需求，稍有不慎编译就过不了，通过cpp/include/raft/neighbors/detail/cagra/cagra_build.cuh可以看到两个函数分别限制了knn_graph和optimized_graph的第二个模板参数必须是int64_t，同时限制了它们的第一个模板参数必须一致。

总结

以Faiss为代表的大部分库编译运行基本都没太大问题，cmake ..，缺啥补啥，然后make -j，使用时-I、-L、-l指定好路径和链接库就完事了；麻烦的还是想Raft这种的，集齐了header-only、cuda、cmake插件等诸多坑点，加上文档写的太垃圾，才会这么恶心。遇到这种麻烦库，绝招就是去找它目录下的testing、demo、template这种文件夹或者同开发者以这些名字命名的仓库，一般这种地方提供的cmakelists/makefile总是可用的。

背景

Faiss

编译

使用

技巧

Raft

编译

运行

总结