SINDI
SINDI (Sparse INverted Dense Index) is VSAG’s index for sparse
vectors — the kind produced by BM25, SPLADE, and other learned-sparse encoders.
Unlike the dense indexes (HGraph, IVF), SINDI operates directly on term/value
pairs and is the only VSAG index that accepts dtype: "sparse".
- Source:
src/algorithm/sindi/ - Example:
examples/cpp/109_index_sindi.cpp
How it works
- Window-based inverted lists. Documents are grouped into fixed-size windows
(
window_size). Within each window, an inverted list per term maps a term id to the(doc_id, value)pairs that mention it. - Optional pruning and quantization. During construction,
doc_prune_ratiodrops low-weight terms per document, anduse_quantizationcompresses the term values to shrink memory further. - Scoring. At query time, SINDI iterates the non-zero terms of the query,
walks the corresponding inverted lists in each window, aggregates contributions
into a max-heap of size
n_candidate, and returns the top-k. Whenuse_reorderis enabled, the candidates are re-scored against a high-precision flat copy.
Distance is returned as 1 - inner_product so results sort ascending as in the
dense indexes.
Quick start
#include <vsag/vsag.h>
std::string params = R"({
"dtype": "sparse",
"metric_type": "ip",
"dim": 1024,
"index_param": {
"term_id_limit": 30000,
"window_size": 50000,
"doc_prune_ratio": 0.0,
"use_quantization": false,
"use_reorder": false,
"remap_term_ids": false
}
})";
auto index = vsag::Factory::CreateIndex("sindi", params).value();
// Build a dataset of SparseVector.
auto base = vsag::Dataset::Make();
base->NumElements(n)
->SparseVectors(sparse_vectors) // vsag::SparseVector*
->Ids(ids)
->Owner(false);
index->Build(base);
// Search.
auto query = vsag::Dataset::Make();
query->NumElements(1)->SparseVectors(&query_vec)->Owner(false);
auto result = index->KnnSearch(
query, /*topk=*/10,
R"({"sindi": {"n_candidate": 100}})").value();
Build parameters
Build-time parameters live under index_param. dtype must be "sparse"
and metric_type must be "ip".
| Parameter | Type | Default | Description |
|---|---|---|---|
dim | int | — (required) | Maximum number of non-zero elements per sparse vector. Not the vocabulary size. |
term_id_limit | int | 1000000 | Upper bound on term id values (≥ max term id + 1). |
window_size | int | 50000 | Documents per window (range: 10 000 – 60 000). |
doc_prune_ratio | float | 0.0 | Fraction of lowest-weight terms dropped per doc at build time (0.0 – 0.9). |
use_quantization | bool | false | Quantize stored term values to cut memory; when enabled, uses 8-bit scalar quantization (SQ8). |
use_reorder | bool | false | Keep a high-precision flat copy and rescore results (~2× memory). |
remap_term_ids | bool | false | Remap term IDs before indexing; useful when term IDs are sparse or have large gaps. |
avg_doc_term_length | int | 100 | Hint for memory estimation only. |
dimvsterm_id_limit. For the sparse vector{0:0.1, 2:0.5, 177:0.8},dimis3(three non-zero entries) whileterm_id_limitmust be ≥178(largest term id + 1). Sizingterm_id_limitto your vocabulary is the most common first-time mistake.
Search parameters
Search-time parameters live under the sindi sub-object:
| Parameter | Type | Default | Description |
|---|---|---|---|
n_candidate | int | 0 | Candidate heap size. When 0, defaults to SPARSE_AMPLIFICATION_FACTOR · topk (500×). If set, must satisfy 1 ≤ n_candidate ≤ SPARSE_AMPLIFICATION_FACTOR · topk. |
query_prune_ratio | float | 0.0 | Fraction of lowest-weight query terms skipped (0.0 – 0.9). |
term_prune_ratio | float | 0.0 | Fraction of term-list entries skipped (0.0 – 0.9). |
use_term_lists_heap_insert | bool | true | Term-list-ordered heap insertion; usually faster. |
auto result = index->KnnSearch(
query, topk,
R"({"sindi": {"n_candidate": 200, "query_prune_ratio": 0.1}})").value();
When to use SINDI
- Sparse retrieval with BM25, SPLADE, uniCOIL, or similar learned-sparse encoders.
- Hybrid dense+sparse pipelines where SINDI handles the sparse leg in parallel with HGraph / IVF for dense embeddings.
- Memory-constrained deployments of sparse corpora (
use_quantization: trueroughly halves memory with a small recall loss;use_reorder: truetrades memory for recall).
SINDI does not accept dense vectors and supports only inner-product similarity. Range search and id-based filtering are supported; see the example for usage.
Practical guidance
- For Chinese corpora, we recommend encoding sparse vectors with BGE-M3. For English corpora, SPLADE is the more common default.
- BGE-M3 can emit both sparse and dense vectors. Today SINDI handles the sparse leg, and VSAG plans to support fused sparse+dense scoring in a future release.
- Sparse vectors are not a complete replacement for BM25 full-text retrieval. In practice, three-way recall with BM25 + sparse + dense usually outperforms any two-way combination.
- At the index level, SINDI can also serve BM25-style scoring: use inverse document frequency as the query-side term weight, and use term-frequency-based weights as the document-side term value.
Common configurations
- Flat brute-force sparse index. Keep all non-zero terms in the inverted index
(
doc_prune_ratio: 0.0), disable the flat reranker (use_reorder: false), and disable quantization (use_quantization: false). This is the simplest high-recall baseline. - Pruned high-accuracy index. Prune most low-weight terms during build
(
doc_prune_ratio: 0.4), keep the flat copy for reranking (use_reorder: true), and enable quantization to shrink inverted-list memory (use_quantization: true). This is a common balance between memory and recall. - Very large sparse vocabularies. When term IDs are sparse within the
uint32range, such as hash-based tokenizers, external vocabulary IDs, or vocabularies with large gaps, enableremap_term_ids: true. This avoids managing many empty posting lists and helps stay below theterm_id_limitceiling.