Skip to content

Citus

Citus is a PostgreSQL extension that turns Postgres into a distributed database. PGVecto.rs works well with Citus. Here are the steps to enable vector search in a Citus maintained PostgreSQL instance.

Citus can works on Single-Node and Multi-Node, both of which are compatible with PGVecto.rs natively.

Single-Node Citus

First, prepare the dockerfile with PGVecto.rs and Citus installed and attach them to PostgreSQL.

TIP

Due to the restriction of Citus, it must be placed at first of shared_preload_libraries.

dockerfile
FROM tensorchord/pgvecto-rs-binary:pg16-v0.3.0-amd64 as binary

FROM postgres:16
COPY --from=binary /pgvecto-rs-binary-release.deb /tmp/vectors.deb
RUN apt-get update && apt-get install -y curl
RUN apt-get install -y /tmp/vectors.deb && rm -f /tmp/vectors.deb
RUN curl https://install.citusdata.com/community/deb.sh -o deb.sh
RUN bash deb.sh 
RUN apt-get install -y postgresql-16-citus-12.1

CMD ["postgres", "-c" ,"shared_preload_libraries=citus,vectors.so", "-c", "search_path=\"$user\", public, vectors"]

We can build and run a Docker image called citus-vector:16 locally.

WARNING

Don't use POSTGRES_HOST_AUTH_METHOD: "trust" in production, see Citus notes about Increasing Worker Security for detailed authentication configure.

shell
docker build -t citus-vector:16 .
docker run --name citus-single -e POSTGRES_HOST_AUTH_METHOD=trust -p 5432:5432 -d citus-vector:16

Once the container is running, we can connect to it with psql, and enable the essential extensions.

sql
-- psql -h 127.0.0.1 -d postgres -U postgres -w
CREATE EXTENSION IF NOT EXISTS citus;
CREATE EXTENSION IF NOT EXISTS vectors;

Once the extensions are created, you can create a simple table, then transform it into a distributed table.

sql
CREATE TABLE items (id bigserial, embedding vector(3), category_id bigint, PRIMARY KEY (id, category_id));
SET citus.shard_count = 4;
SELECT create_distributed_table('items', 'category_id');

The index build and vector search is as same as before. Tables and indexes are distributed to all shards/workers automatically.

sql
INSERT INTO items (embedding, category_id)
SELECT '[1, 1, 1]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000)
UNION ALL
SELECT '[1, 2, 3]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000)
UNION ALL
SELECT '[3, 2, 1]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000);
CREATE INDEX ON items USING vectors (embedding vector_cos_ops) WITH (options = "[indexing.hnsw]");
SELECT id FROM items ORDER BY embedding <=> '[1, 1, 1]' LIMIT 100;

Multi-Node Citus

You can use the same Dockerfile and image for Multi-Node Citus as for [Single-Node Citus](#Single-Node Citus). In this example, we will start a cluster of 1 coordinator node and 2 worker nodes with docker compose.

WARNING

Don't use POSTGRES_HOST_AUTH_METHOD: "trust" in production, see Citus notes about Increasing Worker Security for detailed authentication configure.

yaml
services:
  main:
    container_name: main
    image: 'citus-vector:16'
    ports:
      - "5432:5432"
    environment:
      POSTGRES_HOST_AUTH_METHOD: "trust"
  worker1:
    container_name: worker1
    image: 'citus-vector:16'
    ports:
      - "5431:5432"
    environment:
      POSTGRES_HOST_AUTH_METHOD: "trust"
  worker2:
    container_name: worker2
    image: 'citus-vector:16'
    ports:
      - "5430:5432"
    environment:
      POSTGRES_HOST_AUTH_METHOD: "trust"

Now start all containers with docker-compose. Each container is a Citus node, with 1 coordinator(main) and 2 workers.

shell
docker compose up -d

Once all containers are running, we can enable the essential extensions at all nodes.

shell
psql -h 127.0.0.1 -d postgres -U postgres -w -c 'CREATE EXTENSION IF NOT EXISTS citus;CREATE EXTENSION IF NOT EXISTS vectors;'
psql -h 127.0.0.1 -d postgres -U postgres -w -p 5431 -c 'CREATE EXTENSION IF NOT EXISTS citus;CREATE EXTENSION IF NOT EXISTS vectors;'
psql -h 127.0.0.1 -d postgres -U postgres -w -p 5430 -c 'CREATE EXTENSION IF NOT EXISTS citus;CREATE EXTENSION IF NOT EXISTS vectors;'

To inform the coordinator about its workers, connect to the coordinator node with psql and register the two workers.

sql
-- psql -h 127.0.0.1 -d postgres -U postgres -w
SELECT citus_set_coordinator_host('main', 5432);
SELECT * from citus_add_node('worker1', 5432);
SELECT * from citus_add_node('worker2', 5432);

Finally, you can perform vector queries in the same way as in [Single-Node Citus](#Single-Node Citus).

sql
CREATE TABLE items (id bigserial, embedding vector(3), category_id bigint, PRIMARY KEY (id, category_id));
SET citus.shard_count = 4;
SELECT create_distributed_table('items', 'category_id');
INSERT INTO items (embedding, category_id) 
SELECT '[1, 1, 1]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000)
UNION ALL
SELECT '[1, 2, 3]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000)
UNION ALL
SELECT '[3, 2, 1]'::vector, FLOOR(random() * 100) FROM generate_series(1, 1000);
CREATE INDEX ON items USING vectors (embedding vector_cos_ops) WITH (options = "[indexing.hnsw]");
SELECT id FROM items ORDER BY embedding <-> '[1, 1, 1]' LIMIT 100;

Monitor

If you're using a Citus distributed table, the pg_vector_index_stat view on the coordinator will be empty. Since all the indexes are created at the workers actually, we can only inspect the vector status by distributing this view to the workers.

sql
SELECT * from run_command_on_workers($$ SELECT (to_json(array_agg(pg_vector_index_stat .*))) FROM pg_vector_index_stat $$);