Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

SINDI

SINDI: per-term inverted lists grouped by window; only the lists matching the query’s non-zero terms are walked and accumulated into an n_candidate-sized heap

SINDI (Sparse INverted Dense Index) is VSAG’s index for sparse vectors — the kind produced by BM25, SPLADE, and other learned-sparse encoders. Unlike the dense indexes (HGraph, IVF), SINDI operates directly on term/value pairs and is the only VSAG index that accepts dtype: "sparse".

How it works

  1. Window-based inverted lists. Documents are grouped into fixed-size windows (window_size). Within each window, an inverted list per term maps a term id to the (doc_id, value) pairs that mention it.
  2. Optional pruning and quantization. During construction, doc_prune_ratio drops low-weight terms per document, and use_quantization compresses the term values to shrink memory further.
  3. Scoring. At query time, SINDI iterates the non-zero terms of the query, walks the corresponding inverted lists in each window, aggregates contributions into a max-heap of size n_candidate, and returns the top-k. When use_reorder is enabled, the candidates are re-scored against a high-precision flat copy.

Distance is returned as 1 - inner_product so results sort ascending as in the dense indexes.

Quick start

#include <vsag/vsag.h>

std::string params = R"({
    "dtype": "sparse",
    "metric_type": "ip",
    "dim": 1024,
    "index_param": {
        "term_id_limit": 30000,
        "window_size": 50000,
        "doc_prune_ratio": 0.0,
        "use_quantization": false,
        "use_reorder": false,
        "remap_term_ids": false
    }
})";
auto index = vsag::Factory::CreateIndex("sindi", params).value();

// Build a dataset of SparseVector.
auto base = vsag::Dataset::Make();
base->NumElements(n)
    ->SparseVectors(sparse_vectors)  // vsag::SparseVector*
    ->Ids(ids)
    ->Owner(false);
index->Build(base);

// Search.
auto query = vsag::Dataset::Make();
query->NumElements(1)->SparseVectors(&query_vec)->Owner(false);
auto result = index->KnnSearch(
    query, /*topk=*/10,
    R"({"sindi": {"n_candidate": 100}})").value();

Build parameters

Build-time parameters live under index_param. dtype must be "sparse" and metric_type must be "ip".

ParameterTypeDefaultDescription
dimint— (required)Maximum number of non-zero elements per sparse vector. Not the vocabulary size.
term_id_limitint1000000Upper bound on term id values (≥ max term id + 1).
window_sizeint50000Documents per window (range: 10 000 – 60 000).
doc_prune_ratiofloat0.0Fraction of lowest-weight terms dropped per doc at build time (0.0 – 0.9).
use_quantizationboolfalseQuantize stored term values to cut memory; when enabled, uses 8-bit scalar quantization (SQ8).
use_reorderboolfalseKeep a high-precision flat copy and rescore results (~2× memory).
remap_term_idsboolfalseRemap term IDs before indexing; useful when term IDs are sparse or have large gaps.
avg_doc_term_lengthint100Hint for memory estimation only.

dim vs term_id_limit. For the sparse vector {0:0.1, 2:0.5, 177:0.8}, dim is 3 (three non-zero entries) while term_id_limit must be ≥ 178 (largest term id + 1). Sizing term_id_limit to your vocabulary is the most common first-time mistake.

Search parameters

Search-time parameters live under the sindi sub-object:

ParameterTypeDefaultDescription
n_candidateint0Candidate heap size. When 0, defaults to SPARSE_AMPLIFICATION_FACTOR · topk (500×). If set, must satisfy 1 ≤ n_candidate ≤ SPARSE_AMPLIFICATION_FACTOR · topk.
query_prune_ratiofloat0.0Fraction of lowest-weight query terms skipped (0.0 – 0.9).
term_prune_ratiofloat0.0Fraction of term-list entries skipped (0.0 – 0.9).
use_term_lists_heap_insertbooltrueTerm-list-ordered heap insertion; usually faster.
auto result = index->KnnSearch(
    query, topk,
    R"({"sindi": {"n_candidate": 200, "query_prune_ratio": 0.1}})").value();

When to use SINDI

  • Sparse retrieval with BM25, SPLADE, uniCOIL, or similar learned-sparse encoders.
  • Hybrid dense+sparse pipelines where SINDI handles the sparse leg in parallel with HGraph / IVF for dense embeddings.
  • Memory-constrained deployments of sparse corpora (use_quantization: true roughly halves memory with a small recall loss; use_reorder: true trades memory for recall).

SINDI does not accept dense vectors and supports only inner-product similarity. Range search and id-based filtering are supported; see the example for usage.

Practical guidance

  • For Chinese corpora, we recommend encoding sparse vectors with BGE-M3. For English corpora, SPLADE is the more common default.
  • BGE-M3 can emit both sparse and dense vectors. Today SINDI handles the sparse leg, and VSAG plans to support fused sparse+dense scoring in a future release.
  • Sparse vectors are not a complete replacement for BM25 full-text retrieval. In practice, three-way recall with BM25 + sparse + dense usually outperforms any two-way combination.
  • At the index level, SINDI can also serve BM25-style scoring: use inverse document frequency as the query-side term weight, and use term-frequency-based weights as the document-side term value.

Common configurations

  1. Flat brute-force sparse index. Keep all non-zero terms in the inverted index (doc_prune_ratio: 0.0), disable the flat reranker (use_reorder: false), and disable quantization (use_quantization: false). This is the simplest high-recall baseline.
  2. Pruned high-accuracy index. Prune most low-weight terms during build (doc_prune_ratio: 0.4), keep the flat copy for reranking (use_reorder: true), and enable quantization to shrink inverted-list memory (use_quantization: true). This is a common balance between memory and recall.
  3. Very large sparse vocabularies. When term IDs are sparse within the uint32 range, such as hash-based tokenizers, external vocabulary IDs, or vocabularies with large gaps, enable remap_term_ids: true. This avoids managing many empty posting lists and helps stay below the term_id_limit ceiling.

See also