Skip to main content

Semantic AI search system using OpenAI and PostgreSQL vector extension

Semantic search, which allows searching for content based on its meaning, while maintaining speed and considering keywords.

Semantic AI search system using OpenAI and PostgreSQL vector extension

hero imagehero image

Semantic search, which allows searching for content based on its meaning, while maintaining speed and considering keywords.

What you'll need

The central part is the database, and we will be using Supabase (Postgres database with Pgvector extension)

All other tools, as a matter of fact, including the data provider, can be chosen independently. In our case, it will be @supabase/supabase-js for communicating with the database, and Blitz.js as the full-stack framework to handle both the server-side and frontend parts of the application.

Intro

We will be looking at a real-life example that can already be touched and evaluated for its performance. We have implemented this solution for a conference portal GitNation, which is our product, where we collect content and people who attend our conferences all over the world.

Our use case is quite clear. We wanted to give users the ability to conveniently search for content. What's unusual is that our content is video, but that didn't pose an obstacle and we're now confidently conducting a full-fledged search for the information provided by our speakers. To better understand this, just imagine that you're searching for a video on YouTube not by titles and descriptions, but because of what the speaker is saying inside the video - for every single word!

Getting started

We need to start with content, because after all, we need something to search for in the end :)

In our case, these will be the transcripts of our talks and workshops, which we have extracted from the videos using Whisper, which you can read about in another one of our articles, by the way. But, in fact, this can be absolutely any text or a product catalog with descriptions. The main thing is that it simply exists.

Database

To search by content, we need to store it somewhere. To store content, we need to prepare a database. In essence, there are only 2 simple steps:

Create a free account on Supabase, which will cover all our needs. Enable the pgvector extension, so in the future we can use vectors. Once the database is ready, we need to create a table where the data we will be searching will be stored.

CREATE TABLE "SearchIndex" (
id bigserial primary key,
"contentId" int,
text text,
embedding vector(1536),
);

A little explanation about the columns:

  • contentId - for a reference to the content (product).
  • text - our content.
  • embedding - openAI vectors, which are obtained after processing the text.

That's it. Next, we just need to populate the table with content."

Content upload

The mechanics are not complex, we need to load our content into the table. If you have a product catalog, it's simple. Take the product description and save it to the table. However, we will consider a slightly more complex case, where we have large texts and we want to search by individual parts of the content, so that there is the possibility to navigate and reference individual chunks.

This can be imagined as a huge text, say, a manual for using your robot vacuum cleaner, or a book.

Fortunately, there's nothing difficult about splitting text into pieces. Below, we will present our case. The main thing you need to decide on is the limit for one part, in our case, it's a segment equal to about 200 openAI tokens.

In this function, we cut the text into sentences, then form chunks from them for loading.

import { encode } from 'gpt-3-encoder';

const CHUNK_LIMIT = 200;
export const CHUNK_MINIMAL = 50;

export const chunkTextByTokenLimit = (text, limit = CHUNK_LIMIT) => {
const textChunks = [];

if (encode(text).length > limit) {
const splittedText = text.split('. ');
let chunkText = '';

splittedText.forEach((sentence) => {
const sentenceTokenLength = encode(sentence).length;
const chunkTextTokenLength = encode(chunkText).length;

if (sentenceTokenLength + chunkTextTokenLength > limit) {
textChunks.push(chunkText);
chunkText = '';
}

if (sentence[sentence.length - 1] && sentence[sentence.length - 1].match(/[a-z0-9]/i)) {
chunkText += `${sentence}. `;
} else {
chunkText += `${sentence} `;
}
});

textChunks.push(chunkText.trim());
} else {
textChunks.push(text.trim());
}

let resultChunks = textChunks.map((chunk) => {
const trimmedText = chunk.trim();

return {
content: trimmedText,
contentLength: trimmedText.length,
contentTokens: encode(trimmedText).length,
};
});

if (resultChunks.length > 1) {
resultChunks = resultChunks.filter((chunk, index) => {
const prevChunk = resultChunks[index - 1];

if (chunk.contentTokens < CHUNK_MINIMAL && prevChunk) {
prevChunk.content += chunk.content;
prevChunk.contentLength += chunk.contentLength;
prevChunk.contentTokens += chunk.contentTokens;

return false;
}

return true;
});
} else {
resultChunks = resultChunks.filter(Boolean);
}

return resultChunks;
};

After we have created a function to divide the content into chunks, all that remains is to upload them into the table.

const createEmbedding = async (input) => {
const inputFormatted = input.toLowerCase().replace(/\n/g, ' ');

try {
const embeddingRes = await openAi.createEmbedding({
model: 'text-embedding-ada-002',
input: inputFormatted,
});

const [{ embedding }] = embeddingRes.data.data;
return embedding;
} catch (error) {
throw error;
}
};

const processCallBacks = async (promisesChunks, callback, contentId) => {
for (const chunk of promisesChunks) {
await Promise.all(
chunk.map(async (data) => {
try {
let embed;
// make sure we under OpenAi limit
if (encode(data.content).length < 1500) {
embed = await createEmbedding(data.content);
} else {
embed = await createEmbedding(data.content.slice(0, 800));
}

await callback(data, embed);
} catch (error) {
Logger.error(`Text processing error. id: ${contentId}`, error);
throw error;
}
}),
);
}
};

export const createContentEmbeds = async (content) => {
try {
let contentTextChunks;

contentTextChunks = chunkTextByTokenLimit(content.transcription);

// Lodash _.chunk() Method
const promisesTextChunks: ResultChunk[][] = chunk(contentTextChunks, 2);

await processCallBacks(
promisesTextChunks,
(data, embed) => {
return supabase
.from('SearchIndex')
.insert({
contentId: content.id,
text: data.content,
embedding: embed,
})
.then(({ error }) => {
if (error) {
throw error;
}
});
},
content.id,
);

return 'Done';
} catch (error) {
throw error;
}
};

After you review these functions, you should have an idea of how to load the data; essentially, it all boils down to mapping your content with something like createContentEmbeds.

First queries

Congratulations! After we already have the content, in principle, we can perform the search. Let's give it a try!

In order to compare the user's query with the vectors we created in the previous step, surprisingly, we also need to convert it into a vector. It's straightforward - just call the method, and that's it.

export const semanticSearch = async (searchTerm) => {
const embedding = await createEmbedding(searchTerm);
};

Now, we will need to create a Postgres function through which we will perform the search.

CREATE OR REPLACE FUNCTION semantic_search (
query_embedding VECTOR(1536),
match_threshold FLOAT,
match_count INT
)

RETURNS TABLE (
id BIGINT,
content TEXT,
similarity FLOAT
)

LANGUAGE sql STABLE

AS $$
SELECT
"SearchIndex".id,
"SearchIndex"."contentId",
1 - ("SearchIndex".embedding <=> query_embedding) AS similarity
FROM
"SearchIndex"
WHERE
1 - ("SearchIndex".embedding <=> query_embedding) > match_threshold
ORDER BY
similarity DESC
LIMIT
match_count;
$$;

And now, you can call it in your application, adding the necessary variables to it.

export const semanticSearch = async (searchTerm, elementsCount: number = 5000) => {
const embedding = await createEmbedding(searchTerm);

const { data: results, error } = await supabase.rpc('semantic_search', {
query_embedding: embedding,
match_threshold: 0.78,
match_count: elementsCount,
});

return results;
};

Finally, it's time for images! For the query "awesome AI features," we get the following results.

search resultssearch results

The results are absolutely relevant and align with expectations at this stage! Overall, we could have stopped here if the execution of the function didn't take 6 seconds. Additionally, it's worth noting that sometimes the meaning perceived by OpenAI may not completely align with what the user had in mind. For instance, when we inquire about something like "great react" it might interpret it as a question about people's reactions to something. This is more of our problem than its :), as we know that it lacks context.

Optimisation

As mentioned earlier, the current query takes 5 seconds. In this part, we will explore how to reduce it to 200ms!

We propose the following approach. First, it would be beneficial to reduce the number of rows for which we need to perform the search. Since this has a linear relationship, cutting it in half will double the search speed. There are various ways to achieve this, but we will consider one that may not be the fastest, yet will bring not only an increase in search speed but also an improvement in its functionality. We will add a step before searching through vectors, using full text search, which, by the way, is also vector-based and is already implemented in Postgres by default.

This way, we gain several benefits. We combine classic search with semantic search, taking the best from both approaches, and simultaneously speeding up the semantic search itself. You can read more about the implementation of Postgres full-text search in detail here. However, we will briefly go through the steps.

DB optimisation

We will add a new column named 'fts' to the 'SearchIndex' table, which will be used to store the searchable index of the 'title' and 'description' columns.

ALTER TABLE "SearchIndex"
ADD COLUMN fts tsvector GENERATED ALWAYS AS (to_tsvector('english', text)) STORED;

It should be noted that the text search itself also requires optimization, so we will do it right away. Specifically, we will create an index on this column, which will significantly increase the query performance. (In our case, the query speed decreased from 6 to 0.2 seconds.)

CREATE INDEX "SearchIndex" ON text USING GIN (fts);

After this, you can already test the full-text search using an SQL query. I would like to emphasize that the syntax for this search is unusual. You can read more about it in the documentation. For example, the query below will search for records that contain any word from the phrase.

SELECT * FROM "SearchIndex" WHERE fts @@ to_tsquery('awesome | AI | features');

After this, we need to create an index for our OpenAI vector as well because we want to have these advantages here too. It is essential to read about indexing in the documentation as it will directly impact the number of pages, i.e., the overall performance. link

CREATE INDEX ON "SearchIndex" USING ivfflat (embedding vector_cosine_ops) WITH (lists = 7);

Great! We are ready to update our query and check the results.

Query optimisation

Since we are using a function, we will need to update it specifically, or alternatively, create a new one.

CREATE OR REPLACE FUNCTION semantic_search (
query_string text,
query_embedding vector(1536),
similarity_threshold float,
match_count integer
)
RETURNS TABLE (
id integer,
"contentId" integer,
similarity double precision
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
WITH filtered_content AS (
SELECT * FROM "SearchIndex" WHERE fts @@ to_tsquery(query_string)
)
SELECT
filtered_content.id,
filtered_content."contentId",
1 - (filtered_content.embedding <=> query_embedding) as similarity
FROM filtered_content
WHERE 1 - (filtered_content.embedding <=> query_embedding) > similarity_threshold
ORDER BY similarity desc
LIMIT match_count;
END;
$$;

Result query

Alright, now all that's left is to call the function and see what we've got.

export const semanticSearch = async (searchTerm, elementsCount: number = 5000) => {
const embedding = await createEmbedding(searchTerm);
const pgQuery = searchTerm.toLowerCase().replace(/\s+/g, ' | ');

const { data: results, error } = await supabase.rpc('semantic_search', {
query_string: pgQuery,
query_embedding: embedding,
similarity_threshold: 0.78,
match_count: elementsCount,
});

return results;
};

That's great to hear! The results are completely relevant. Choosing the best option might be debatable since the presentations are the same but in different places. However, the execution speed has significantly improved, reducing from 6 seconds to 512ms, which is 12 times faster!

search resultssearch results

Conclusions

Alright, let's take a look at what we have accomplished. We have developed a search function that combines two approaches: keyword-based search and semantic search. It works swiftly and essentially comes at no cost to you.

Below, I will leave you with the essential links that you should read if you want to fine-tune everything to suit your needs.

Remember that you can already check the production version on the portal GitNation

Thank you for your time! All the best!

Links

CONTACT US TODAY

Don't want to fill out the form? Then contact us by email [email protected]