Vector Search

Recently I learned about a new kind of search called Vector Search or Semantic Search. This is a search technique that tries to find documents that match the meaning of the user’s search term instead of trying to match keywords like a Full Text Search (FTS).

I wanted to try Semantic Search for my blog. I came across Alex Garcia’s post about a new SQLite extension for Vector Search called sqlite-vss. Since my blog data is already in a SQLite database I figured, why not?

The idea behind semantic search is to encode the contents of each document into a vector of floating point numbers called embeddings. Then use cosine-similarity algorithm to match search terms with documents. Calculating the embeddings requires a python library called sentence transformers. This can be installed with pip:

$ pip install 'torch<2' sentence-transformers

I used the trusty sqlite-utils to add the embeddings to my database into new columns. The CLI has a convert sub-command that can be used to run a python function on each row of a table and write the results into a different column. I wrote a python function that calculates the embeddings and returns them as bytes. The results are written into a new column called title_embeddings of type blob.

First let’s run the embeddings on the title column:

$ sqlite-utils convert posts.db posts title '
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def convert(value):
    return model.encode(value).tobytes()
' \
    --output title_embeddings \
    --output-type blob

Next is the mdbody column to calculate the embeddings of each post’s body:

$ sqlite-utils convert posts.db posts mdbody '
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def convert(value):
    return model.encode(value).tobytes()
' \
    --output body_embeddings \
    --output-type blob

Now we enable the sqlite-vss extension and use it to build an index.

I’m going to use my favorite CLI for SQLite called litecli.

$ litecli blog.db

The two .so files that we downloaded from sqlite-vss github releases page are loaded into the database:

sqlite> .load ./vector0
sqlite> .load ./vss0

Using the vss0 extension we create a table called posts_vss that will hold the index:

sqlite> CREATE VIRTUAL TABLE posts_vss
        USING vss0(title_embedding(384), body_embedding(384))

Next we insert the data from the posts table into the posts_vss table:

sqlite> INSERT INTO posts_vss (rowid, title_embedding, body_embedding)
               SELECT rowid, title_embedding, body_embedding FROM posts

Optionally, we can create a trigger that will keep the posts_vss table in sync with the posts table:

sqlite> CREATE TRIGGER posts_vss_ai AFTER INSERT ON posts 
          BEGIN 
               INSERT INTO posts_vss (rowid, title_embedding, body_embedding) 
               VALUES (new.rowid, new.title_embedding, new.body_embedding); 
          END;

We are ready to search using the vector search technique. When the user types in a query, we will create embeddings of the user input using the same encoding algorithm we used for the title and body.

# vector_search.py
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query = input("Enter search term: ")
query_embedding = model.encode(query).tolist()

Using the embeddings of the user input we can search the posts_vss table for the closest matches. I decided to do the query from python since encoding the search term had to be done in python. First I pip install sqlite_vss library.

# vector_search.py
import sqlite3
import sqlite_vss
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query = input("Enter search term: ")
query_embedding = model.encode(query).tolist()

db = sqlite3.connect("blog.db")
db.enable_load_extension(True)
sqlite_vss.load(db)

stmt = f"""
with body_matches as (
        select rowid from posts_vss where vss_search(body_embedding, '{query_embedding}')
        limit 5
        ),
    title_matches as (
        select rowid from posts_vss where vss_search(title_embedding, '{query_embedding}')
        limit 5
        )
select distinct posts.id, posts.url, posts.title 
    from body_matches, title_matches 
    left join posts on posts.rowid = body_matches.rowid or posts.rowid = title_matches.rowid
"""
results = db.execute(stmt)
print(list(results))

This searches both the title and the body for the closest matches and returns the top 5 results. The results are sorted by the closest match first. Here is a sample output:

$ python vector_search.py
Enter search term: lemon
[
 (134, 'http://blog.amjith.com/the-lemonade-stand', 'The Lemonade Stand'), 
 (116, 'http://blog.amjith.com/orange-juice-with-p-star-star-p', 'Orange Juice with p**p'), 
 (190, 'https://blog.amjith.com/orange', 'Orange?'), 
 (35, 'http://blog.amjith.com/shenanigans', 'Shenanigans'), 
 (118, 'http://blog.amjith.com/chocolate-juice', 'Chocolate Juice'), 
 (49, 'http://blog.amjith.com/conversations-with-a-4-year-old', 'Conversations with a 4 year old'), 
 (158, 'http://blog.amjith.com/dinner-and-bsg', 'Dinner and BSG')
]

The results are pretty good.

Datasette

How do we get this to work with Datasette? Datasette has a plugin system that allows us to extend the functionality of Datasette. The author of the sqlite-vss has created a datasette plugin called datasette-sqlite-vss which loads the sqlite-vss extension for the sqlite3 db when datasette starts.

datasette install datasette-sqlite-vss

The plugin also adds a new SQL function called vss_search that can be used to search the index. The plugin is installed and enabled when datasette starts. Now we can use the vss_search function to search the index.

We are still missing a piece. How do we get the user input from the search box into the SQL query? Remember the plugin system of datasette. I wrote a small plugin that can convert a user input string into the embeddings using SentenceTransformer.

# vector_encode.py
import json
from datasette import hookimpl
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

@hookimpl
def prepare_connection(conn):
    conn.create_function("vector_encode", 1, vector_encode)

def vector_encode(term):
    embeddings = model.encode(term)
    return json.dumps(embeddings.tolist())

The plugin creates a new SQL function called vector_encode that can be used to encode a string into a vector. Save this in a python file called vector_encode.py in a folder called plugins.

datasette blog.db --plugins=plugins/

Now we can use the vector_encode function to encode the user input and use the vss_search function to search the index. Here is the SQL query that does the search:

with body_matches as (
        select rowid from posts_vss where vss_search(body_embedding, vector_encode(:term))
        limit 5
        ),
    title_matches as (
        select rowid from posts_vss where vss_search(title_embedding, vector_encode(:term))
        limit 5
        )
select distinct posts.id, posts.url, posts.title from body_matches, title_matches 
    left join posts on posts.rowid = body_matches.rowid 
    or posts.rowid = title_matches.rowid

Visit http://localhost:8001/blog/posts and paste the query in the SQL editor and click Run SQL. You should see an input box that let’s you type in the search term.

I would go a step farther to use the canned-query feature in datasette to make this slightly easier.

Create a metadata.yml file

databases:
  blog:
    queries:
      vector_search:
        sql: |-
          with body_matches as (
                  select rowid from posts_vss where vss_search(body_embedding, vector_encode(:term))
                  limit 5
                  ),
              title_matches as (
                  select rowid from posts_vss where vss_search(title_embedding, vector_encode(:term))
                  limit 5
                  )
          select distinct posts.id, posts.url, posts.title from body_matches, title_matches left join posts on posts.rowid = body_matches.rowid or posts.rowid = title_matches.rowid          
        title: Vector Search

Then relaunch datasette with the metadata file.

datasette blog.db --metadata=metadata.yml --plugins=plugins/

Visit http://localhost:8001/blog and click on the Vector Search query. You should see an input box that let’s you type in the search term.

Finally publish it to fly.io using the datasette-publish-fly plugin.

datasette publish fly blog.db --plugins-dir=plugins/ --metadata=metadata.yml \
                              --app=blog-vector-search \
                              --install=datasette-sqlite-vss \
                              --install="'torch<2'" \
                              --install=sentence-transformers

The additional --install flags are needed to install the dependencies for the plugin that we created to encode the search term.

Unfortunately this does not fit in the free-tier fly.io instances. So I don’t have a demo version to show you. But trust me, it is awesome.

Thank you, Alex Garcia and Simon Willison for making these cool projects and writing about them in detail.

Datasette#

Datasette