Recently I learned about a new kind of search called Vector Search or Semantic Search. This is a search technique that tries to find documents that match the meaning of the user’s search term instead of trying to match keywords like a Full Text Search (FTS).
I wanted to try Semantic Search for my blog. I came across Alex Garcia’s post about a new SQLite extension for Vector Search called sqlite-vss. Since my blog data is already in a SQLite database I figured, why not?
The idea behind semantic search is to encode the contents of each document into a vector of floating point numbers called embeddings. Then use cosine-similarity algorithm to match search terms with documents. Calculating the embeddings requires a python library called sentence transformers. This can be installed with pip:
$ pip install 'torch<2' sentence-transformers
I used the trusty sqlite-utils to add the embeddings to my database into new columns. The CLI has a convert
sub-command that can be used to run a python function on each row of a table and write the results into a different column. I wrote a python function that calculates the embeddings and returns them as bytes. The results are written into a new column called title_embeddings
of type blob
.
First let’s run the embeddings on the title
column:
$ sqlite-utils convert posts.db posts title '
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def convert(value):
return model.encode(value).tobytes()
' \
--output title_embeddings \
--output-type blob
Next is the mdbody
column to calculate the embeddings of each post’s body:
$ sqlite-utils convert posts.db posts mdbody '
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
def convert(value):
return model.encode(value).tobytes()
' \
--output body_embeddings \
--output-type blob
Now we enable the sqlite-vss extension and use it to build an index.
I’m going to use my favorite CLI for SQLite called litecli.
$ litecli blog.db
The two .so
files that we downloaded from sqlite-vss github releases page are loaded into the database:
sqlite> .load ./vector0
sqlite> .load ./vss0
Using the vss0 extension we create a table called posts_vss
that will hold the index:
sqlite> CREATE VIRTUAL TABLE posts_vss
USING vss0(title_embedding(384), body_embedding(384))
Next we insert the data from the posts
table into the posts_vss
table:
sqlite> INSERT INTO posts_vss (rowid, title_embedding, body_embedding)
SELECT rowid, title_embedding, body_embedding FROM posts
Optionally, we can create a trigger that will keep the posts_vss
table in sync with the posts
table:
sqlite> CREATE TRIGGER posts_vss_ai AFTER INSERT ON posts
BEGIN
INSERT INTO posts_vss (rowid, title_embedding, body_embedding)
VALUES (new.rowid, new.title_embedding, new.body_embedding);
END;
We are ready to search using the vector search technique. When the user types in a query, we will create embeddings of the user input using the same encoding algorithm we used for the title and body.
# vector_search.py
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query = input("Enter search term: ")
query_embedding = model.encode(query).tolist()
Using the embeddings of the user input we can search the posts_vss
table for the closest matches. I decided to do the query from python since encoding the search term had to be done in python. First I pip install sqlite_vss
library.
# vector_search.py
import sqlite3
import sqlite_vss
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
query = input("Enter search term: ")
query_embedding = model.encode(query).tolist()
db = sqlite3.connect("blog.db")
db.enable_load_extension(True)
sqlite_vss.load(db)
stmt = f"""
with body_matches as (
select rowid from posts_vss where vss_search(body_embedding, '{query_embedding}')
limit 5
),
title_matches as (
select rowid from posts_vss where vss_search(title_embedding, '{query_embedding}')
limit 5
)
select distinct posts.id, posts.url, posts.title
from body_matches, title_matches
left join posts on posts.rowid = body_matches.rowid or posts.rowid = title_matches.rowid
"""
results = db.execute(stmt)
print(list(results))
This searches both the title and the body for the closest matches and returns the top 5 results. The results are sorted by the closest match first. Here is a sample output:
$ python vector_search.py
Enter search term: lemon
[
(134, 'http://blog.amjith.com/the-lemonade-stand', 'The Lemonade Stand'),
(116, 'http://blog.amjith.com/orange-juice-with-p-star-star-p', 'Orange Juice with p**p'),
(190, 'https://blog.amjith.com/orange', 'Orange?'),
(35, 'http://blog.amjith.com/shenanigans', 'Shenanigans'),
(118, 'http://blog.amjith.com/chocolate-juice', 'Chocolate Juice'),
(49, 'http://blog.amjith.com/conversations-with-a-4-year-old', 'Conversations with a 4 year old'),
(158, 'http://blog.amjith.com/dinner-and-bsg', 'Dinner and BSG')
]
The results are pretty good.
Datasette
How do we get this to work with Datasette? Datasette has a plugin system that allows us to extend the functionality of Datasette. The author of the sqlite-vss has created a datasette plugin called datasette-sqlite-vss which loads the sqlite-vss extension for the sqlite3 db when datasette starts.
datasette install datasette-sqlite-vss
The plugin also adds a new SQL function called vss_search
that can be used to search the index. The plugin is installed and enabled when datasette starts. Now we can use the vss_search
function to search the index.
We are still missing a piece. How do we get the user input from the search box into the SQL query? Remember the plugin system of datasette. I wrote a small plugin that can convert a user input string into the embeddings using SentenceTransformer.
# vector_encode.py
import json
from datasette import hookimpl
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
@hookimpl
def prepare_connection(conn):
conn.create_function("vector_encode", 1, vector_encode)
def vector_encode(term):
embeddings = model.encode(term)
return json.dumps(embeddings.tolist())
The plugin creates a new SQL function called vector_encode
that can be used to encode a string into a vector. Save this in a python file called vector_encode.py
in a folder called plugins
.
datasette blog.db --plugins=plugins/
Now we can use the vector_encode
function to encode the user input and use the vss_search
function to search the index. Here is the SQL query that does the search:
with body_matches as (
select rowid from posts_vss where vss_search(body_embedding, vector_encode(:term))
limit 5
),
title_matches as (
select rowid from posts_vss where vss_search(title_embedding, vector_encode(:term))
limit 5
)
select distinct posts.id, posts.url, posts.title from body_matches, title_matches
left join posts on posts.rowid = body_matches.rowid
or posts.rowid = title_matches.rowid
Visit http://localhost:8001/blog/posts and paste the query in the SQL editor and click Run SQL. You should see an input box that let’s you type in the search term.
I would go a step farther to use the canned-query feature in datasette to make this slightly easier.
Create a metadata.yml file
databases:
blog:
queries:
vector_search:
sql: |-
with body_matches as (
select rowid from posts_vss where vss_search(body_embedding, vector_encode(:term))
limit 5
),
title_matches as (
select rowid from posts_vss where vss_search(title_embedding, vector_encode(:term))
limit 5
)
select distinct posts.id, posts.url, posts.title from body_matches, title_matches left join posts on posts.rowid = body_matches.rowid or posts.rowid = title_matches.rowid
title: Vector Search
Then relaunch datasette with the metadata file.
datasette blog.db --metadata=metadata.yml --plugins=plugins/
Visit http://localhost:8001/blog and click on the Vector Search query. You should see an input box that let’s you type in the search term.
Finally publish it to fly.io using the datasette-publish-fly plugin.
datasette publish fly blog.db --plugins-dir=plugins/ --metadata=metadata.yml \
--app=blog-vector-search \
--install=datasette-sqlite-vss \
--install="'torch<2'" \
--install=sentence-transformers
The additional --install
flags are needed to install the dependencies for the plugin that we created to encode the search term.
Unfortunately this does not fit in the free-tier fly.io instances. So I don’t have a demo version to show you. But trust me, it is awesome.
Thank you, Alex Garcia and Simon Willison for making these cool projects and writing about them in detail.